operations.
Figure 2.11. Tensor coprocessor data path
The coprocessor data path is designed by assuming that the activations and weights, respectively, have row-major and column-major layout in memory, in order to avoid the complexities of Morton memory indexing (Rovder et al. 2019). Due to the mixed-precision arithmetic, matrix operands may take one, two or four consecutive registers, with element sizes of one, two, four and eight bytes. In all cases, the coprocessor operations interpret matrix operands as having four rows and a variable number of columns, depending on the number of consecutive registers and the element size. In order to support this invariant, four 32-byte “load-scatter” instructions are provided to coprocessor registers. A load-scatter instruction loads 32 consecutive bytes from memory, interprets these as four 64-bit (8 bytes) blocks and writes each block into a specified quarter of each register that composes the destination operand (Figure 2.12). After executing the four load-scatter variants, a 4×P submatrix of a matrix with row-major order in memory is loaded into a coprocessor register quadruple.
The coprocessor implements matrix multiply-accumulate operations on INT8.32, INT16.64 and FP16.32 arithmetic1. The coprocessor is able to multiply-accumulate 4 × 8 by 8 × 4 8-bit matrices into a 4 × 4 32-bit matrix (128 MAC operations per clock cycle), held in two consecutive registers (Figure 2.13). The 8 × 4 8-bit matrix operand is actually a 4 × 8 operand that is transposed at the input port of the multiply-accumulate operation. The coprocessor may also perform multiply-accumulate operations of two 4 × 4 16-bit matrices into a 4 × 4 64-bit matrix (64 MAC operations per clock cycle), held in four consecutive registers. Finally, the coprocessor supports multiply-accumulate of two 4 × 4 FP16 matrices into a 4 × 4 FP32 matrix, but performed by four successive operations2 (16 FMA operations per clock cycle). The FP16.32 matrix operations actually compute exact four-deep dot products with accumulation, by applying Kulisch’s principles on an 80+ϵ accumulator (Brunie 2017).
Figure 2.12. Load-scatter to a quadruple register operand
Figure 2.13. INT8.32 matrix multiply-accumulate operation
2.4. The MPPA3 software environments
2.4.1. High-performance computing
The programming environment used for high-performance computing on the MPPA3 processor is derived from the Portable Computing Language (PoCL) project3, which proposes an open-source implementation of the OpenCL 1.2 standard4 with support for some of the OpenCL 2.0 features. The OpenCL-C kernels are compiled with LLVM, which has been retargeted for this purpose to the Kalray VLIW core. In OpenCL, a host application offloads computations to an abstract machine:
– an OpenCL device is an offloading target where computations are sent using a command queue;
– an OpenCL device has a global memory allocated and managed by the host application, and shared by the multiple compute units of the OpenCL device;
– an OpenCL compute unit comprises several processing elements (PEs) that share the compute unit local memory;
– each OpenCL PE also has a private memory, and shared access to the device’s global memory without cache coherence across compute units.
The OpenCL sub-devices are defined as non-intersecting sets of compute units inside a device, which have dedicated command queues while sharing the global memory.
On the MPPA3 processor, high-performance computing functions are dispatched to partitions composed of one or more compute clusters, each of which is exposed as an OpenCL sub-device. In the port of the PoCL environment, support for OpenCL sub-devices has been developed, while two offloading modes are provided:
LWI (Linearized Work Items): all the work items of a work group are executed within a loop on a single PE. This is the default execution mode of PoCL;
SPMD (Single Program Multiple Data): the work items of a work group are executed concurrently on the PEs of a compute cluster, with the _ _local
OpenCL memory space shared by the PEs and located into the SMEM (Figure 2.14).
These mappings of the abstract OpenCL machine elements onto the MPPA3 architecture components are presented in Table 2.4. Although the LWI mode appears better suited to the OpenCL-C kernel code written for GPGPU processors, the SPMD mode is preferred for optimizing performance, as it allows the configuration of most of the compute cluster SMEM as OpenCL local memory shared by the work group.
Figure 2.14. OpenCL NDRange execution using the SPMD mode
OpenCL | Device | Global memory | Sub-device | Compute unit |
MPPA3 | MPPA processor or | External DDR | Group of | Compute cluster (SPMD) |
Component | MPPA domain | memory | compute cluster(s) | Processing element (LWI) |
Table 2.4. OpenCL machine elements and MPPA architecture components
Most often, there is a need to port C/C++ code and to access the high-performance features implemented in the GCC compiler for the Kalray VLIW core. Among these, the C named address space extension defined by ISO/IEC TR 18037:2008 is used to annotate objects and addresses that are accessed using non-temporal (L1D cache bypass) and/or non-trapping loads. In order to call the code compiled by GCC and the MPPA communication libraries (Hascoët et al. 2017) from OpenCL-C kernels, the LLVM OpenCL-C compiler and PoCL have been extended to understand function declarations annotated with _ _attribute_ _ ((mppa_native))
. Whenever such reference is seen in OpenCL-C source code, the PoCL linking stages assumes that the symbol is resolved, and the MPPA3 compute cluster run-time environment dynamically loads and links the native function, before starting the execution of the kernel.
This native function extension also enables kernels to access other services such as a lightweight lock-free POSIX multi-threading environment, fast inter-PE hardware synchronizations, dynamic local memory allocation and remoting of system calls to the host OS, including FILE I/O.
2.4.2. KaNN code generator