internal boot ROM
through a private bus. The purpose of the secure zone is to host a trusted execution environment (TEE), and a run-time system that performs on-demand code decryption for the applications that require it (Table 2.1). The non-secure zone contains 16 application cores (PE) that share a 16-banked local memory (SMEM), a DMA engine, a debug support unit (DSU), a cryptographic accelerator and cluster-local peripheral control registers. The SMEM can be configured to split its 4 MB capacity between scratch-pad memory (SPM) and level-2 cache (L2$). The non-secure zone of the MPPA3 compute cluster supports two types of execution environments:
– A symmetric multi-processing (SMP) environment, exposed through the standard POSIX multi-threading (supporting the OpenMP run-time of C/C++ compilers) and file system APIs. In this environment targeting high-performance computing under soft real-time constraints, all the core L1 data caches are kept coherent, and the local memory is interleaved across the banks at 64-byte granularity for the SPM and at 256-byte granularity for the L2$.
– An asymmetric multi-processing (AMP) environment, seen as a collection of 16 cores where each executes under an RTOS and is associated through linker maps with one particular bank (256KB) of the local memory. In this environment targeting high-integrity computing under hard real-time constraints, L1 cache coherence is disabled, and the local memory is configured as scratch-pad memory only.
By default, the L2 caches of the compute clusters are not kept coherent, although this can be enabled by running cache controller firmware in the RM cores and maintaining a distributed directory in the cluster SPMs. The third level of the memory hierarchy is composed of the external DDR memory (2x DDR4/LPDDR4 64-bit channels), and of the SPM of other compute clusters. The DDR memory channels can be interleaved or separated in machine address space, in the latter case operating independently. The standard C11 atomic operations are available in all memory spaces.
Figure 2.9. Local interconnects of the MPPA3 processor
2.3.3. VLIW core
The MPPA cores implement a 64-bit VLIW architecture, which is an effective way to design instruction-level parallel cores targeting numerical, signal and image processing (Fisher et al. 2005). The VLIW core has six issue lanes (Figure 2.10) that, respectively, feed a branch and control unit (BCU), two 64-bit ALUs, a 128-bit FPU, a 256-bit load–store unit (LSU) and a deep learning coprocessor. Each VLIW core has private L1 instruction and data caches, both 16 KB and four-way set associated with LRU replacement policy. All load instructions also have an L1 cache-bypass variant for direct access to the cluster SPM or L2$. These instructions improve the performance of codes with non-temporal memory access patterns, and also increase the accuracy of static analysis for computing worst-case execution time (WCET) bounds. The implementation of this VLIW core and its caches ensure that the resulting processing element is timing-compositional, a critical property with regard to computing accurate bounds on worst-case response times (WCRT) (Kästner et al. 2013).
Figure 2.10. VLIW core instruction pipeline
Based on previous compiler design experience with different types of VLIW architectures (Dupont de Dinechin et al. 2000, 2004), a Fisher-style VLIW architecture has been selected, rather than an EPIC-style VLIW architecture (Table 2.3). The main features of the Kalray VLIW architecture are as follows: – Partial predication: fully predicated architectures are expensive with regard to instruction encoding, while control speculation of arithmetic instructions performs better than if-conversion when applicable. Moreover, conditional SELECT operations are equivalent to CMOV operations with operand renaming constraints (Dupont de Dinechin 2014). Then, if-conversion only needs to be supported by conditional load/store and CMOV instructions on scalar and vector operands.
Table 2.3. Types of VLIW architectures
Classical VLIW architecture | EPIC VLIW architecture |
SELECT operations on Boolean operand | Fully predicated ISA |
Conditional load/store/floating-point operations | Advanced loads (data speculation) |
Dismissible loads (control speculation) | Speculative loads (control speculation) |
Clustered register files and function units | Polycyclic/multiconnect register files |
Multi-way conditional branches | Rotating registers |
Compiler techniques | |
Trace scheduling | Modulo scheduling |
Partial predication | Full predication |
Main examples | |
Multiflow TRACE processors | Cydrome Cydra-5 |
HP Labs Lx / STMicroelectronics ST200 | HP-Intel IA64 |
Philips TriMedia | Texas Instruments VelociTI |
– Dismissible loads: these instructions enable control speculation of load instructions by suppressing exceptions on address errors, and by ensuring that no side-effects occur in the I/O areas. Additional configuration in the MMU refine their behavior on protection and no-mapping exceptions.
– No rotating registers: rotating registers rename temporary variables defined inside software pipelines, whose schedule is built while ignoring register antidependences. However, rotating registers add significant ISA and implementation complexity, while temporary variable renaming can be done by the compiler.
– Widened memory access: widening the memory accesses on a single port is simpler to implement than multiple memory ports, especially when memory address translation is implied. This simplification enables, in turn, the support of misaligned memory accesses, which significantly improves compiler vectorization opportunities.
– Unification of the scalar and SIMD data paths around a main register file of 64×64-bit registers, for the same motivations as the POWER vector-scalar architecture (Gschwind 2016). Operands for the SIMD instructions map to register pairs (16 bytes) or to register quadruples (32 bytes).
2.3.4. Coprocessor
On the MPPA3 processor, each VLIW core is paired with a tightly coupled coprocessor for the mixed-precision matrix multiply-accumulate operations of deep learning operators (Figure 2.11). The coprocessor operates on a dedicated data path that includes a 48×256-bit register file. Within the six-issue VLIW core architecture, one issue lane is dedicated to the coprocessor arithmetic instructions, while the branch and control unit (BCU) may also execute data transfer operations between the coprocessor registers and the VLIW core general-purpose registers. Finally, the coprocessor leverages the 256-bit LSU of the VLIW core to transfer data blocks from/to the SMEM, at the rate of 32 bytes per clock cycle. It then uses these 32-byte data blocks as left and right operands