rel="nofollow" href="#ulink_4dcadcb0-3a53-5ac4-8332-ff49fc1f88df">8.4. Hardware approaches 8.5. Modeling and experimenting 8.6. Conclusion 8.7. References
8 PART 3: Interconnect and Interfaces 9 Network-on-Chip (NoC): The Technology that Enabled Multi-processor Systems-on-Chip (MPSoCs) 9.1. History: transition from buses and crossbars to NoCs 9.2. NoC configurability 9.3. System-level services 9.4. Hardware cache coherence 9.5. Future NoC technology developments 9.6. Summary and conclusion 9.7. References 10 Minimum Energy Computing via Supply and Threshold Voltage Scaling 10.1. Introduction 10.2. Standard-cell-based memory for minimum energy computing 10.3. Minimum energy point tracking 10.4. Conclusion 10.5. Acknowledgments 10.6. References 11 Maintaining Communication Consistency During Task Migrations in Heterogeneous Reconfigurable Devices 11.1. Introduction 11.2. Background 11.3. Related works 11.4. Proposed communication methodology in hardware context switching 11.5. Implementation of the communication management on reconfigurable computing architectures 11.6. Experimental results 11.7. Conclusion 11.8. References
11 Index
List of Tables
1 Chapter 1Table 1.1. Input data rates and model complexities for example machine learning ...Table 1.2. Supported kernels in the embARC MLI libraryTable 1.3. Model parameters of the CIFAR-10 CNN graphTable 1.4. Performance data for the CIFAR-10 CNN graph
2 Chapter 2Table 2.1. Cyber-security requirements by application areaTable 2.2. Types of network-on-chip interconnectsTable 2.3. Types of VLIW architectures
3 Chapter 4Table 4.1. Comparison of state-of-the-art CNN acceleratorsTable 4.2. Synthesis results for different configurations of the ASIP
4 Chapter 5Table 5.1. CRM operations’ latency
5 Chapter 7Table 7.1. Setup details
6 Chapter 11Table 11.1. Resource utilization of the communication wrapper in Basic, CS and C...Table 11.2. Resource utilization of the communication wrapper in Basic, CS and C...Table 11.3. Size of hardware task context for each benchmark applicationTable 11.4. Comparison of total execution time (FPGA cycles) and its variation o...Table 11.5. Comparison of average extraction and restoration times (FPGA cycles)...Table 11.6. Comparison of average context switch latency (FPGA cycles) between C...Table 11.7. Task migration time between A5SOC and ZC706
List of Illustrations
1 Chapter 1Figure 1.1. Training and inference in machine learningFigure 1.2. Different types of processing in machine learning inference applicat...Figure 1.3. 2D convolution applying a weight kernel to input data to calculate a...Figure 1.4. Example pooling operations: max pooling and average poolingFigure 1.5. Two types of vector MAC instructions of the ARC EM9D processorFigure 1.6. ARC EM9D processor with XY memory and address generation unitsFigure 1.7. Assembly code generated from MLI C-code for a fully connected layer ...Figure 1.8. Assembly code generated from MLI C-code for 2D convolution of 16-bit...Figure 1.9. CNN graph of the CIFAR-10 example applicationFigure 1.10. MLI code of the CIFAR-10 inference application
2 Chapter 2Figure 2.1. Homogeneous multi-core processor (Firesmith 2017)Figure 2.2. NVIDIA fermi GPGPU architecture (Huang et al . 2013)Figure 2.3. Operation of a Volta tensor core (NVIDIA 2020)Figure 2.4. Numerical formats used in deep learning inference (adapted from Gust...Figure 2.5. Autoware automated driving system functions (CNX 2019)Figure 2.6. Application domains and partitions on the MPPA3 processorFigure 2.7. Overview of the MPPA3 processorFigure 2.8. Global interconnects of the MPPA3 processorFigure 2.9. Local interconnects of the MPPA3 processorFigure 2.10. VLIW core instruction pipelineFigure 2.11. Tensor coprocessor data pathFigure 2.12. Load-scatter to a quadruple register operandFigure 2.13. INT8.32 matrix multiply-accumulate operationFigure 2.14. OpenCL NDRange execution using the SPMD modeFigure 2.15. KaNN inference code generator workflowFigure 2.16. Activation splitting across MPPA3 compute clustersFigure 2.17. KaNN augmented computation graphFigure 2.18. ROSACE harmonic multi-periodic case study (Graillat et al. 2018)Figure 2.19. MCG code generation of the MPPA processor
3 Chapter 3Figure 3.1. Plural many-core architecture. Many cores, hardware accelerators and...Figure 3.2. Task state graphFigure 3.3. Many-flow pipelining: (a) Task graph and single execution of an imag...Figure 3.4. Core management tableFigure 3.5. Task management tableFigure 3.6. Core state graphFigure 3.7. Allocation (top) and termination (bottom) algorithmsFigure 3.8. Plural run-time software. The kernel enables boot, initialization, t...Figure 3.9. Event sequence performing stream inputFigure 3.10. Plural software development kitFigure 3.11. Matrix multiplication code on the Plural architectureFigure 3.12. Task graph for matrix multiplication
4 Chapter 4Figure 4.1. Overview of the ASIP pipeline with its vector ALUs and register file...Figure 4.2. On-chip memory subsystem with banked vector memories and an example ...Figure 4.3. Cell area of the synthesized cores’ logic for different clock period...Figure 4.4. 3x3 NoC based on the HERMES frameworkFigure 4.5. Runtime over flit-width and port buffer size for two exemplary layer...Figure 4.6. Runtime for different packet lengths
5 Chapter 5Figure 5.1. Roofline: a visual performance model for multi-core architectures. A...Figure 5.2. Proposed tile-based many-core architectureFigure 5.3. Directory savings using the RBCC concept compared to global coherenc...Figure 5.4. RBCC-malloc() exampleFigure 5.5. Internal block diagram of the coherency region managerFigure 5.6. Breakdown of the CRM’s resource utilization for increasing