KaNN (Kalray Neural Network) code generator is a deep learning inference compiler targeting the MPPA3 platform. It takes as input a trained neural network model, described within a standard framework such as Caffe, TensorFlow or ONNX, and produces executable code for a set of compute clusters exposed as an OpenCL sub-device (Figure 2.15). Targeting OpenCL sub-devices allows several model inferences to execute concurrently on a single MPPA3 processor. The KaNN code generator optimizes for batch-1 inference, with the primary objective of reducing latency. At the user’s option, FP32 operators in the original network can be converted to FP16 operators. Integer quantization, such as the one used by TensorFlow Lite, is also supported; however, it must be expressed in the input model. Indeed, such models are assumed to be trained with fake quantization (Jacob et al. 2018), which must match the actual quantization applied during inference.
Figure 2.15. KaNN inference code generator workflow
Following the import of the input model into an intermediate representation, optimizations are applied to the compute graph:
– elimination of channel concatenation and slicing copies;
– padding of input activations of convolutional layers;
– folding of batch normalizations, scalings, additions, into a single pointwise fused multiply-add operator;
– fusion of convolutions with ReLU activation functions;
– adaptation of arithmetic representations.
The KaNN code generation scheme performs inference in topological sort order of the (optimized) compute graph, parallelizing the execution of each operator over all the compute clusters of the target sub-device. When executing an operator, its input and output activations are distributed across the target local memories configured as SPM, while the network parameters are read from the (external) DDR memory. Depending on the type of operator (convolutional or fully connected), the spatial dimension sizes and the channel depth, input and output activations are distributed over the compute cluster local memories by splitting either along the spatial dimensions or along the channel dimension (Figure 2.16):
– In case of spatial splitting of the output activations, each compute cluster only accesses an input activation tile and its shadow region, while all the operator parameters are required; these are read once from the DDR memory and multicasted to all the target compute clusters.
– In case of channel splitting of the output activations, the full input layer must be replicated in the local memory of each compute cluster, but only the corresponding slice of parameters is read from the DDR memory.
In all cases, activations are computed once, laid out sequentially along the channel dimension and possibly copied to other local memories.
Figure 2.16. Activation splitting across MPPA3 compute clusters
For any compute cluster in the target sub-device, the code generation process defines and implements a local schedule for:
– local memory buffer allocations/deallocations;
– DDR memory read/multicast of parameters;
– execution of operator operations;
– inter-cluster activation exchanges;
– inter-cluster synchronizations.
This process is backed by the computation graph (Figure 2.17) augmented with parameter read tasks (yellow) and activation production tasks (blue).
The results of KaNN code generation is a collection of OpenCL binary kernels, where each kernel interprets the contents of a static data block composed of a sequence of records. Each record contains its length, a native compute function pointer and a structure containing arguments for the compute function. For each record, the OpenCL kernel calls the native compute function with the pointer to the structure. The kernel ends after the interpretation of the last record.
Figure 2.17. KaNN augmented computation graph
2.4.3. High-integrity computing
High-integrity computing on the MPPA3 processor refers to applications that execute in a physically isolated domain of the processor, whose functions are developed under model-based design and must meet hard real-time constraints. The Research Open-Source Avionics and Control Engineering (ROSACE) case study introduced the model-based design of avionics applications that targeted multi-core platforms (Pagetti et al. 2014). The model-based design for the MPPA processor focuses on mono-periodic and multi-periodic harmonic applications (Figure 2.18) that are described using the Lustre (Halbwachs et al. 1991) or the SCADE Suite5 synchronous dataflow languages (Graillat et al. 2018, 2019). The execution environment is composed of one or more clusters configured for asymmetric multi-processing (section 2.3.2), where each core is logically associated with one SMEM bank, and where tasks run to completion.
Figure 2.18. ROSACE harmonic multi-periodic case study (Graillat et al. 2018)
The code generation workflow assumes that some nodes of the synchronous dataflow program are identified by the programmer as concurrent tasks, and defines the implicit top-level “root” task. A Lustre or SCADE Suite high-level compiler generates C-code for this set of tasks, communicating and synchronizing through one-to-one channels. Channels correspond to single-producer, single-consumer FIFOs of depth one, whose implementation is abstracted form the task C-code under SEND and RECV macros. The rest of the code generation workflow involves:
– providing workers, each able to execute a set of tasks sequentially;
– scheduling and mapping the set of tasks on the workers;
– implementing the communication channels and their SEND/RECV methods;
– compiling C-code with the CompCert formally verified compiler.
In the MPPA workflow, the workers are the PEs associated with a memory bank.
Timing verification follows the principles of the multi-core response time analysis (MRTA) framework (Davis et al. 2018). Starting from the task graph, its mapping to PEs, and given the worst-case execution time (WCET) of each task in isolation, the multi-core inference analysis (MIA) tool (Rihani et al. 2016; Dupont de Dinechin et al. 2020) refines the execution intervals of each task while updating its WCET for interference on the shared resources. The MIA tool relies on the property that the PEs, the memory hierarchy and the interconnects are timing-compositional. The refined release dates are used to activate a fast hardware release mechanism for each task. A task then executes when its input channels are data-ready (Figure 2.19).
Figure 2.19. MCG code generation of the MPPA processor
2.5.