Liliana Andrade

Multi-Processor System-on-Chip 1


Скачать книгу

by specific functions that execute critical code segments. The execution of such code segments may be accelerated dramatically by adding a few custom instructions. A further benefit of using these custom instructions is that the code size is reduced.

      Both configurability and extensibility need to be used at design time. This must be supported by a tool chain (i.e. compiler, simulator, debugger) that is automatically enhanced to support the selected configuration and the added custom instructions. For example, the compiler must generate optimal code for the selected configuration while supporting programmers in using the custom instructions. Similarly, simulation models must support the selected configuration and include the custom instructions. If done properly, large performance gains can be achieved while optimizing area, power and code size, with a minimal impact on design time.

      As an example of extensibility, we consider Viterbi decoding, which is a prominent function in an NB-IoT protocol stack for performing forward error correction (FEC) in the receiver. When using a straightforward software implementation on an off-the-shelf processor, this kernel becomes one of the most computationally intensive parts of an NB-IoT modem. Viterbi or similar FEC schemes are used in many communication technologies, especially in the IoT field, and often are a bottleneck in modem design.

      In (Petrov-Savchenko and van der Wolf 2018), a processor extension for Viterbi decoding is presented using four custom instructions, which enhance the performance to just a few cycles per decoded bit. The instructions include a reset instruction, two instructions to calculate the path metrics and one instruction for the traceback. The instructions can be conveniently used as intrinsic instructions in the C source code. The resulting implementation reduces the worst-case MHz requirements for the Viterbi decoding function in an NB-IoT protocol stack to less than 1 MHz.

      In this section, we investigate in detail the requirements and processor capabilities for efficient machine learning in low-power IoT edge devices. The common theme in machine learning is that algorithms that have the ability to learn without being explicitly programmed are used (Samuel 1959). As shown in Figure 1.1, in machine learning, we distinguish between training and inference.

      Figure 1.1. Training and inference in machine learning

      Training starts with an untrained model, for example a multi-layered neural network with a chosen graph structure. In these neural networks, each layer transforms input data into output data while applying sets of coefficients or weights. Using a machine learning framework such as Caffe or TensorFlow, the model is trained using a large training dataset. The result of the training is a trained model, for example, a neural network with its weights tuned for classifying input data into certain categories. Such categories may, for example, be the different types of human activity in the activity tracker device mentioned above.

      The processing requirements of machine learning inference can vary wildly for different applications. Some key factors impacting the processing requirements are:

       – input data rate: this is the rate at which data samples are captured by the sensor(s). These samples can, for example, be pixels coming from a camera or pulse-code modulation (PCM) samples coming from a microphone. The input data rate can range from tens of samples per second, for example, for human activity recognition with a small number of sensors, to hundreds of millions of samples per second, for advanced computer vision with a high-resolution camera capturing images at a high frame rate;

       – complexity of the trained model: this defines the number of operations to be performed for a set of samples (e.g. an input image) upon inference. For example, in the case of neural networks, the complexity depends on the number of layers in a graph, the sizes of the (multidimensional) input and output maps for each layer, and the number of weights to be applied in the calculation of the output maps. A lowcomplexity neural network has less than 10 layers, while a high-complexity neural network can have tens of layers (Szegedy et al. 2015).

      Table 1.1 shows input data rates and model complexities for several example machine learning applications.

      Table 1.1. Input data rates and model complexities for example machine learning applications

Machine learning application Input data rate Complexity of trained model
Human activity recognition 10s Hz (few sensors) Low to medium
Voice control 10s kHz (e.g. 16 kHz) Low to medium
Face detection 100s kHz (low resolution and frame rate) Low to medium
Advanced computer vision 100s MHz (high resolution and frame rate) High

      In the next section, we further detail the requirements for the efficient implementation of low/mid-end machine learning inference. Using the DSP-enhanced DesignWare® ARC® EM9D processor as an example, we discuss the features and capabilities of a programmable processor that enable efficient execution of computations often used in machine learning inference. We further present an extensive library of software functions that can be used to quickly implement inference engines for a wide array of machine learning applications. Benchmarks are presented to demonstrate the efficiency of machine learning inference implemented using this software library that executes on the ARC EM9D processor.

      1.3.1. Requirements for low/mid-end machine learning inference

      IoT edge devices that use machine learning inference typically perform different types of processing, as shown in Figure 1.2.