Liliana Andrade

Multi-Processor System-on-Chip 1


Скачать книгу

      For many IoT edge devices, low cost is a key requirement. Therefore, making IoT edge devices smarter by adding machine learning inference must be cost-effective. The main contributor to cost is silicon area, in particular, for high-volume products, so it is important that the processor implementing the machine learning inference minimizes the logic area and uses small memories. In addition, small code size is key to limiting the area of the instruction memory.

      Many IoT edge devices are battery-operated and have a tight power budget. This demands a power-efficient processor, measured in uW/MHz, as well as an excellent cycle efficiency so that the processor can be run at a low frequency. Low power consumption is particularly important for IoT edge devices that perform always-on functions such as:

       – smart speakers, smartphones, etc. with always-on voice command functions that are “always listening”;

       – camera-based devices, performing, for example, face detection or gesture recognition that are “always watching”;

       – health and fitness monitoring devices that are “always sensing”.

      Such devices typically apply smart techniques to reduce power consumption. For example, an “always listening” device may sample the microphone signal and use simple voice detection techniques to check whether anyone is speaking at all. It then applies the more compute-intensive machine learning inference for recognizing voice commands only when voice activity is detected. A processor must limit power consumption in each of these different states, i.e. voice detection and voice command recognition. For this purpose, it must offer various power management features, including effective sleep modes and power-down modes.

      1.3.2. Processor capabilities for low-power machine learning inference

      Selecting the right processor is key to achieving high efficiency for the implementation of low/mid-end machine learning inference. In this section, we will describe a number of key capabilities of the DSP-enhanced ARC EM9D processor and illustrate how they can be used to implement neural network processing efficiently.

      Figure 1.5. Two types of vector MAC instructions of the ARC EM9D processor

      Both of these vector MAC instructions operate on 2x16-bit vector operands. The DMAC instruction on the left is a dual-MAC that can be used to implement a dot product, with A1 and A2 being two neighboring samples from the input map and B1 and B2 being two neighboring weights from the weight kernel. The ARC EM9D processor supports 32-bit accumulators for which an additional eight guard bits can be enabled to avoid overflow. The DMAC operation can effectively be used for weight kernels with an even width, reducing the number of MAC instructions by a factor of two compared to a scalar implementation. However, for weight kernels with an odd width, this instruction is less effective. In such cases, the VMAC instruction, shown on the right in Figure 1.5, can be used to perform two dot-product operations in parallel, accumulating intermediate results into two accumulators. In case the weight kernel “moves” over the input map with a stride of one, A1 and A2 are two neighboring samples from the input map and the value of B1 and B2 is the same weight that is applied to both A1 and A2.

Schematic illustration of ARC EM9D processor with XY memory and address generation units.

      Figure 1.6. ARC EM9D processor with XY memory and address generation units

      The AGUs support the following features relevant to machine learning inference:

       – multiple modifiers per address pointer, which allow different schemes for address pointer updates to be prescribed and used. For example, a 2D access pattern can be supported by having one modifier prescribing a small horizontal stride within a row in the input map and another modifier prescribing a large stride to move the pointer to the next row in the input map;

       – data size conversions, which allow, for example, 2x8-bit data to be expanded on the fly for use as a 2x16-bit vector operand. No extra instructions for unpacking and sign extension are required;

       – replications, which allow data values to be replicated on the fly into vectors. For example, a single weight value may be replicated into a 2x16 vector for use in the VMAC instruction as discussed above.

      In summary, the use of XY memory and AGUs enables very efficient code as no instructions are needed to load and store data, perform pointer math, or convert and rearrange data. All of these are performed implicitly while accessing data through the AGUs, with up to three memory accesses per cycle. In the next section, we present code examples that illustrate the use of the processor’s XY memory and AGUs for machine learning inference.

      1.3.3. A software library for machine learning inference

      After selecting the right processor, the next question is how to arrive at an efficient