alt="Schematic illustration of different types of processing in machine learning inference applications."/>
Figure 1.2. Different types of processing in machine learning inference applications
These devices typically perform some pre-processing and feature extraction on the sensor input data before performing the actual neural network processing for the trained model. For example, a smart speaker with voice control capabilities may first pre-process the voice signal by performing acoustic echo cancellation and multimicrophone beam-forming. It may then apply FFTs to extract the spectral features for use in the neural network processing, which has been trained to recognize a vocabulary of voice commands.
1.3.1.1. Neural network processing
For each layer in a neural network, the input data must be transformed into output data. An often used transformation is the convolution, which convolves, or, more precisely, correlates, the input data with a set of trained weights. This transformation is used in convolutional neural networks (CNNs), which are often applied in image or video recognition.
Figure 1.3 shows a 2D convolution, which performs a dot-product operation using the weights of a 2D weight kernel and a selected 2D region of the input data with the same width and height as the weight kernel. The dot product yields a value (M23) in the output map. In this example, no padding is applied on the borders of the input data, hence the coordinate (2, 3) for the output value. For computing the full output map, the weight kernel is “moved” over the input map and dot-product operations are performed for the selected 2D regions, producing an output value with each dot product. For example, M24 can be calculated by moving one step to the right and performing a dot product for the region with input samples A24–A26, A34–A36 and A44–A46.
Figure 1.3. 2D convolution applying a weight kernel to input data to calculate a value in the output map
Input and output maps are often three-dimensional. That is, they have a width, a height and a depth, with different planes in the depth dimension typically referred to as channels. For input maps with a depth > 1, an output value can be calculated using a dot-product operation on input data from multiple input channels. For output maps with a depth > 1, a convolution must be performed for each output channel, using different weight kernels for different output channels. Depthwise convolution is a special convolution layer type for which the number of input and output channels is the same, with each output channel being calculated from the one input channel with the same depth value as the output channel. Yet another layer type is the fully connected layer, which performs a dot-product operation for each output value using the same number of weights as the number of samples in the input map.
The key operation in the layer types described above is the dot-product operation on input samples and weights. It is therefore a requirement for a processor to implement such dot-product operations efficiently. This involves efficient computation, for example, using MAC instructions, as well as efficient access to input data, weight kernels and output data.
CNNs are feed-forward neural networks. When a layer processes an input map, it maintains no state that impacts the processing of the next input map. Recurrent neural networks (RNNs) are a different kind of neural network that maintain the state while processing sequences of inputs. As a result, RNNs also have the ability to recognize patterns across time, and are often applied in text and speech recognition applications.
There are many different types of RNN cells from which a network can be built. In its basic form, an RNN cell calculates an output as shown in equation [1.1]:
where xt is the frame t in the input sequence, ht is the output for xt, Wx and Wh are weight sets, b is a bias, and f() is an output activation function. Thus, the calculation of an output involves a dot product of one set of weights with new input data and another dot product of another set of weights with the previous output data. Therefore, also for RNNs, the dot product is a key operation that must be implemented efficiently. The long short-term memory (LSTM) cell is another well-known RNN cell. The LSTM cell has a more complicated structure than the basic RNN cell that we discussed above, but the dot product is again a dominant operation.
Activation functions are used in neural networks to transform data by performing some nonlinear mapping. Examples are rectified linear units (ReLU), sigmoid and hyperbolic tangent (TanH). The activation functions operate on a single data value and produce a single result. Hence, for an activation layer, the size of the output map is equal to the size of the input map.
Figure 1.4. Example pooling operations: max pooling and average pooling
Neural networks may also have pooling layers that transform an input map into a smaller output map by calculating single output values for (small) regions of the input data. Figure 1.4 shows two examples: max pooling and average pooling. Effectively, the pooling layers downsample the data in the width and height dimensions. The depth of the output map is the same as the depth of the input map.
1.3.1.2. Implementation requirements
To obtain sufficiently accurate results when implementing machine learning inferences, appropriate data types must be used. During the training phase, data and weights are typically represented as 32-bit floating-point data. The common practice for deploying models for inference in embedded systems is to work with integer or fixed-point representations of quantized data and weights (Jacob et al. 2017). The potential impact of quantization errors can be taken into account in the training phase to avoid a notable effect on the model performance. Elements of input maps, output maps and weight kernels can typically be represented using 16-bit or smaller data types. For example, in a voice application, data samples are typically represented by 16-bit data types. In image or video applications, a 12-bit or smaller data type is often sufficient for representing the input samples. Precision requirements can differ per layer of the neural network. Special attention should be paid to the data types of the accumulators that are used to sum up the partial results when performing dot-product operations. These accumulators should be wide enough to avoid overflow of the (intermediate) results of the dot-product operations performed on the (quantized) weights and input samples.
Memory requirements for low/mid-end machine learning inference are typically modest, thanks to limited input data rates and the use of neural networks with limited complexity. Input and output maps are often of limited size, i.e. a few tens of kBs or less, and the number and size of the weight kernels are also relatively small. The use of the smallest possible data types for input maps, output maps and weight kernels helps us to reduce memory requirements.
In summary, low/mid-end machine learning inference applications require the following types of processing:
– various types of pre-processing and feature extraction, often with DSP-intensive computations;
– neural network processing, with the dot-product operation as a dominant computation and regular access patterns on multidimensional data. Additional requirements come from the use of scalar activation functions and pooling operations working on 2D data;
– decision-making, which is performed after the neural network processing,