unit is less efficient than an ASIC implementation of the same compute unit. However, the ASIC implementation is static, and cannot adapt after design tape out. We will examine more coarse-grain alternatives for reconfigurable compute units in Section 4.4.
Examples
• Accelerators—early GPUs, mpeg/media decoders, crypto accelerators
• Programmable Cores—modern GPUs, general purpose cores, ASIPs
• Future designs may feature accelerators in primary computational role
• Some programmable cores and or programmable fabric are still included for generality/longevity
Section 3 covers the customization of processor cores and Section 4 covers coprocessors and accelerators. We split compute components into two sections to better manage the diversity of the design space for these components.
2.1.2 ON-CHIP MEMORY HIERARCHY
Chips are fundamentally pin-limited, which impacts the amount of bandwidth that can be supplied to the compute units described in the previous section. This is further exacerbated by limitations in DRAM scaling. On-chip memory is one technique to mitigate this. On-chip memory can be used in a variety of ways, from providing data buffering for streaming data from off-chip to providing a place to store data that will be reused multiple times for computation. Once again, different applications will have different on-chip memory requirements. As with compute units, there are a wide array of design choices for hardware architects to consider in the design of the memory hierarchy.
Transparency to Software
A cache is relatively small, but fast memory which leverages the principle of locality to reduce the latency to access memory. Data which will be reused in the near future is kept in the cache to avoid accesses to longer latency memory. There are two primary approaches to managing a cache (i.e., orchestrating what data comes into the cache and what data leaves the cache): purely hardware approaches and software managed caches. In this book, we will use the term scratchpad to refer to software-based caches, where the application writer or the compiler will be responsible for explicitly bringing data into and out of the cache through special instructions. Hardware caches where actual control circuits orchestrate data movement without software intervention will just be referred to as caches in this book. Scratchpads have tremendous potential for application-specific customization since cache management can be tuned to a particular application, but they also come with coding overhead as the programmer or compiler writer must explicitly map out this orchestration. Conventional caches are more flexible, as they can handle a wider array of applications without requiring explicit management, and may be preferable in cases where the access pattern is unpredictable and therefore requires dynamic adaptation.
Sharing
On-chip memory may be kept private to a particular compute unit or may be shared among multiple compute units. Private on-chip memory means that the application will not need to contend for space with another application, and will get the full benefit of the on-chip memory. Shared on-chip memory can amortize the cost of on-chip memory over several compute units, providing a potentially larger pool of space that can be leveraged by these compute units than if the space was partitioned among the units as private memory. For example, four compute units can each have 1MB of on-chip memory dedicated to them. Each compute unit will always have 1MB regardless of the demand from other compute units. However, if four compute units share 4MB of on-chip memory, and if the different compute units use different amounts of memory, one compute unit may, for example, use more than 1MB of space at a particular time since there is a large pool of memory available. Sharing works particularly well when compute units use different amounts of memory at different times. Sharing is also extremely effective when compute units make use of the same memory locations. For example, if compute units are all working on an image in parallel, storing the image in a single memory shared among the units allows the compute units to more effectively cooperate on the shared data.
2.1.3 NETWORK-ON-CHIP
On-chip memory stores the data needed by the compute units, but an important part of the overall CSoC is the communication infrastructure that allows this stored data to be distributed to the compute units, that allows the data to be delivered to/from the on-chip memory from/to the memory interfaces that communicate off-chip, and that allows compute units to synchronize and communicate with one another. In many applications there is a considerable amount of data that must be communicated to the compute units used to accelerate application performance. And with multiple compute units often employed to maximize data level parallelism, there are often multiple data streams being communicated around the CSoC. These requirements transcend the conventional bus-based design of older multicore designs, with designers instead choosing network-on-chip (NoC) designs. NoC designs enable the communication of more data between more CSoC components.
Components interfacing with an NoC typically bundle transmitted data into packets, which contain at least address information as to the desired communication destination and the payload itself, which is some portion of the data to be transmitted to a particular destination. NoCs transmit messages via packets to enable flexible and reliable data transport—packets may be buffered at intermediate nodes within the network or reordered in some situations. Packet-based communication also avoids long latency arbitration that is associated with communication in a single hop over an entire chip. Each hop through a packet-based NoC performs local arbitration instead.
The creation of an NoC involves a rich set of design decisions that may be highly customized for a set of applications in a particular domain. Most design decisions impact the latency or bandwidth of the NoC. In simple terms, the latency of the NoC is how long it takes a given piece of data to pass through the NoC. The bandwidth of the NoC is how much data can be communicated in the NoC at a particular time. Lower latency may be more important for synchronizing communication, like locks or barriers that impact multiple computational threads in an application. Higher bandwidth is more important for applications with streaming computation (i.e., low data locality) for example.
One example of a design decision is the topology of an NoC. This refers to the pattern of links that connect particular components of the NoC. A simple topology is a ring, where each component in the NoC is connected to two neighboring components, forming a chain of components. More complex communication patterns may be realized by more highly connected topologies that allow more simultaneous communication or a shorter communication distance.
Another example is the bandwidth of an individual link in the topology—wire that is traversed in one cycle of the network’s clock. Larger links can improve bandwidth but require more buffering space at intermediate network nodes, which can increase power cost.
An NoC is typically designed with a particular level of utilization in mind, where decisions like topology or link bandwidth are chosen based on an expected level of service. For example, NoCs may be designed for worst case behavior, where the bandwidth of individual links is sized for peak traffic requirements, and every path in the network is capable of sustaining that peak bandwidth requirement. This is a flexible design in that the worst case behavior can manifest on any particular communication path in the NoC, and there will be sufficient bandwidth to handle it. But it can mean overprovisioning the NoC if worst case behavior is infrequent or sparsely exhibited in the NoC. In other words, the larger bandwidth components can mean wasted power (if only static power) or area in most cases. NoCs may also be designed for average case behavior, where the bandwidth is sized according to the average traffic requirement, but in such cases performance can suffer when worst case behavior is exhibited.
Topological Customization
Customized designs can specialize different parts of the NoC for different communication patterns seen in applications within a domain. For example, an architecture may specialize the NoC such that there is a high bandwidth connection between a memory interface and a particular compute unit that performs workload balancing and sorting for particular tasks, and then there is a lower bandwidth connection between that compute unit for workload balancing and