power9 has two types of cores: SMT4 and SMT8, where the latter has twice the fetch/decode capacity of the former. The L2 cache is private to SMT8, but if we use SMT4 cores, it is shared among two cores. Level 3 is shared, banked, and built out of eDRAM. But DRAM has high density, as we said earlier, and L3 is a massive 120 MB and has nonuniform cache access (NUCA). This cache is divided into 12 regions with 20-way set associativity per region. This means a region is local per SMT8 core, or two SMT4 cores, but can be accessed by the other cores with higher latency (hence NUCA). The on-chip bandwidth is 7 TB/s (tera bytes per second). If we leave the chip to access the main memory, POWER9 has a bandwidth of up to 120 GB/s to a DDR4 memory. These numbers are important because it gives you an indication of how slow/fast getting your data from the memory is, and how crucial it is to have a cache-friendly memory access pattern.
For big problem sizes, you will use a machine with several multicore processors and accelerators (like a GPU, for example). Therefore, it is important to know the bandwidth available to you from the processor to the accelerator because it affects your decision to outsource the problem to the accelerator or do it in-house in the multicore itself. POWER9 is equipped with PCIe (PCI Express) generation 4 with 48 lanes (a single lane gives about 1.9 GB/s), a 16 GB/s interface for connecting neigh-boring sockets, and a 25 GB/s interface that can be used by externally connected accelerators or I/O devices.
Multicore processors represent one of the pieces of the puzzle of heterogeneous computing. But there are some other chips that are much better than multicore processors for certain types of applications. The term much better here means they have a better performance per watt. One of these well-known chips that is playing a big role in our current era of artificial intelligence and big data is the graphics processing unit (GPU).
2.2GPUs
Multicore processors are MIMD in Flynn’s classification. MIMD is very generic and can implement all other types. But if we have an application that is single instruction (or program or thread)–multiple data, then a multicore processor may not be the best choice [Kang et al. 2011]. Why is that? Let’s explain the reason with an example. Suppose we have the matrix-vector multiplication operation that we saw in the previous chapter (repeated here in Algorithm 2.1 for convenience). If we write this program in a multithreaded way and we execute it on a multicore processor, where each thread is responsible for calculating a subset of the vector Y, then each core must fetch/decode/issue instructions for threads, even though they are the same instructions for all the threads. This does not affect the correctness of the execution but is a waste of time and energy.
Algorithm 2AX= Y: Matrix-Vector Multiplication
for i = 0 to m – 1 do
y[i] = 0;
for j = 0 to n – 1 do
y[i] += A[i][j] * X[j];
end for
end for
If we now try to execute the same program on a GPU, the situation will be different. SIMD architectures have several execution units (named differently by different companies) that share the same front end for fetching/decoding/issuing instructions, thus, amortizing the overhead of that part. This also will save a lot of the chip real estate for more execution units, resulting in much better performance.
Figure 2.3 shows a generic GPU. Each block that is labeled lower level scheduling can be seen as a front end and many execution units. Each execution unit is Algorithm 2.1 responsible for calculating one or more elements for vector Y in the example of Algorithm 2.1. Why do we have several blocks then? There are several reasons. First, threads assigned to the different execution units within the same block can exchange data and synchronize among each other. It would be extremely expensive to do that among the execution units of all the chips as there are hundreds in small GPUs and thousands in high-end GPUs. So this distributed design makes the cost manageable. Second, it gives some flexibility. You can execute different SIMD-friendly applications on different blocks. This is why we have high-level scheduling shown in the figure. Execution units of different blocks can communicate, albeit in a slow manner, through the memory shared among all the blocks, labeled “memory hierarchy” in the figure, because in some designs there are some cache levels above the global memory as well as specialized memories like texture memory.
Figure 2.3Generic GPU Design
The confusing thing about GPUs is that each brand has its own naming convention. In NVIDIA parlance, those blocks are called streaming multiprocessors (SM or SMX in later version) and the execution units are called streaming processors (SPs) or CUDA cores. In AMD parlance those blocks are called shader engines and the execution units are called compute units. In Intel parlance, the blocks are called slices (or sub-slices) and the execution units are called just like that: execution units. There are some very slight differences between each design, but the main idea is almost the same.
GPUs can be discrete, that is, stand-alone chips connected to the other processors using connections like PCIe or NVLink, or they can be embedded with the multicore processor in the same chip. On the one hand, the discrete ones are of course more powerful because they have more real estate. But they suffer from the communication overhead of sending the data back and forth between the GPU’s memory and the system’s memory [Jablin et al. 2011], even if the programmer sees a single virtual address space. On the other hand, the embedded GPUs, like Intel GPUs and many AMD APUs, are tightly coupled with the multicore and do not suffer from communication overhead. However, embedded GPUs have limited area because they share the chip with the processor and hence are weaker in terms of performance.
If you have a discrete GPU in your system, there is a high chance you also have an embedded GPU in your multicore chip, which means you can make use of a multicore processor, an embedded GPU, and a discrete GPU, which is a nice exercise of heterogeneous programming!
Let’s see an example of a recent GPU: the Volta architecture V100 from NVIDIA [2017]. Figure 2.4 shows the block diagram of the V100. The giga thread engine at the top of the figure is what we called high-level scheduling in our generic GPU of Figure 2.3. Its main purpose is to schedule blocks to SMs. A block, in NVIDIA parlance, is a group of threads, doing the same operations on different data, assigned to the same SM, so that they can share data more easily and synchronize. There is an L2 cache shared by all, and it is the last-level cache (LLC) before going off-chip to the GPU global memory, not shown in the figure. NVIDIA packs several SMs together in what are called GPU processing clusters (GPCs). In Volta there are six GPCs; each one has 14 SMs. You can think of a GPC as a small full-fledged GPU, with its SMs, raster engines, etc. The main players, who actually do the computations, are the SMs.
Figure 2.4NVIDIA V100 GPU block diagram. (Based on NVIDIA, 2017. NVIDIA Tesla v100 GPU architecture)
Figure 2.5NVIDIA V100 GPU streaming multiprocessor. (Based on NVIDIA, 2017. NVIDIA Tesla v100 GPU architecture)
Figure 2.5 shows the internal configuration of a single SM. Each SM is equipped with an L1 data cache and a shared memory. The main difference is that the cache is totally transparent to the programmer. The shared memory is controllable by the programmer and can be used