spatial height/width
Given the shape parameters in Table 2.1,5 the computation of a CONV layer is defined as:
o, i, f, and b are the tensors of the ofmaps, ifmaps, filters, and biases, respectively. U is a given stride size.
Figure 2.2b shows a visualization of this computation (ignoring biases). As much as possible, we will adhere to the following coloring scheme in this book.
• Blue: input activations belonging to an input feature map.
• Green: weights belonging to a filter.
• Red: partial sums—Note: since there is no formal term for an array of partial sums, we will sometimes label an array of partial sums as an output feature map and color it red (even though, technically, output feature maps are composed of activations derived from partial sums that have passed through a nonlinear function and therefore should be blue).
Returning to the CONV layer calculation in Equation (2.1), one notes that the operands (i.e., the ofmaps, ifmaps, and filters) have many dimensions. Therefore, these operands can be viewed as tensors (i.e., high-dimension arrays) and the computation can be treated as a tensor algebra computation where the computation involves performing binary operations (e.g., multiplications and additions forming dot products) between tensors to produce new tensors. Since the CONV layer can be viewed as a tensor algebra operation, it is worth noting that an alternative representation for a CONV layer can be created using the tensor index notation found in [51], which describes a compiler for sparse tensor algebra computations.6 The tensor index notation provides a compact way to describe a kernel’s functionality. For example, in this notation matrix multiply Z = AB can be written as:
That is, the output point (i, j) is formed by taking a dot product of k values along the i-th row of A and the j-th column of B.7 Extending this notation to express computation on the index variables (by putting those calculations in parenthesis) allows a CONV layer in tensor index notation to be represented quite concisely as:
In this calculation, each output at a point (n, m, p, q) is calculated as a dot product taken across the index variables c, r, and s of the specified elements of the input activation and filter weight tensors. Note that this notation attaches no significance to the order of the index variables in the summation. The relevance of this will become apparent in the discussion of dataflows (Chapter 5) and mapping computations onto a DNN accelerator (Chapter 6).
Finally, to align the terminology of CNNs with the generic DNN,
• filters are composed of weights (i.e., synapses), and
• input and output feature maps (ifmaps, ofmaps) are composed of input and output activations (partial sums after application of a nonlinear function) (i.e., input and output neurons).
Figure 2.3: Fully connected layer from convolution point of view with H = R, W = S, P = Q = 1, and U = 1.
2.3.2 FC LAYER (FULLY CONNECTED)
In an FC layer, every value in the output feature map is a weighted sum of every input value in the input feature map (i.e., it is fully connected). Furthermore, FC layers typically do not exhibit weight sharing and as a result the computation tends to be memory-bound. FC layers are often processed in the form of a matrix multiplication, which will be explained in Chapter 4. This is the reason while matrix multiplication is often associated with DNN processing.
An FC layer can also be viewed as a special case of a CONV layer. Specifically, a CONV layer where the filters are of the same size as the input feature maps. Therefore, it does not have the local, sparsely connected with weight sharing property of CONV layers. Therefore, Equation (2.1) still holds for the computation of FC layers with a few additional constraints on the shape parameters: H = R, W = S, P = Q = 1, and U = 1. Figure 2.3 shows a visualization of this computation and in the tensor index notation from Section 2.3.1 it is:
2.3.3 NONLINEARITY
A nonlinear activation function is typically applied after each CONV or FC layer. Various nonlinear functions are used to introduce nonlinearity into the DNN, as shown in Figure 2.4. These include historically conventional nonlinear functions such as sigmoid or hyperbolic tangent. These were popular because they facilitate mathematical analysis/proofs. The rectified linear unit (ReLU) [52] has become popular in recent years due to its simplicity and its ability to enable fast training, while achieving comparable accuracy.8 Variations of ReLU, such as leaky ReLU [53], parametric ReLU [54], exponential LU [55], and Swish [56] have also been explored for improved accuracy. Finally, a nonlinearity called maxout, which takes the maximum value of two intersecting linear functions, has shown to be effective in speech recognition tasks [57, 58].
Figure 2.4: Various forms of nonlinear activation functions. (Figure adapted from [62].)
2.3.4 POOLING AND UNPOOLING
There are a variety of computations that can be used to change the spatial resolution (i.e., H and W or P and Q) of the feature map depending on the application. For applications such as image classification, the goal is to summarize the entire image into one label; therefore, reducing the spatial resolution may be desirable. Networks that reduce input into a sparse output are often referred to as encoder networks. For applications such as semantic segmentation, the goal is to assign a label to each pixel in the image;9 as a result, increasing the spatial resolution may be desirable. Networks that expand input into a dense output are often referred to as decoder networks.
Reducing the spatial resolution of a feature map is referred to as pooling or more generically downsampling. Pooling, which is applied to each channel separately, enables the network to be robust and invariant to small shifts and distortions. Pooling combines, or pools,