in its receptive field into a smaller number of values. Pooling can be parameterized based on the size of its receptive field (e.g., 2×2) and pooling operation (e.g., max or average), as shown in Figure 2.5. Typically, pooling occurs on non-overlapping blocks (i.e., the stride is equal to the size of the pooling). Usually a stride of greater than one is used such that there is a reduction in the spatial resolution of the representation (i.e., feature map). Pooling is usually performed after the nonlinearity.
Figure 2.5: Various forms of pooling.
Figure 2.6: Various forms of unpooling/upsampling. (Figures adapted from [64].)
Increasing the spatial resolution of a feature map is referred to as unpooling or more generically as upsampling. Commonly used forms of upsampling include inserting zeros between the activations, as shown in Figure 2.6a (this type of upsampling is commonly referred to as unpooling10), interpolation using nearest neighbors [63, 64], as shown in Figure 2.6b, and interpolation with bilinear or bicubic filtering [65]. Upsampling is usually performed before the CONV or FC layer. Upsampling can introduce structured sparsity in the input feature map that can be exploited for improved energy efficiency and throughput, as described in Section 8.1.1.
2.3.5 NORMALIZATION
Controlling the input distribution across layers can help to significantly speed up training and improve accuracy. Accordingly, the distribution of the layer input activations (δ, μ) are normalized such that it has a zero mean and a unit standard deviation. In batch normalization (BN), the normalized value is further scaled and shifted, as shown in Equation (2.5), where the parameters (γ, β) are learned from training [66]:11,12
where ∊ is a small constant to avoid numerical problems.
Prior to the wide adoption of BN, local response normalization (LRN) [7] was used, which was inspired by lateral inhibition in neurobiology where excited neurons (i.e., high value activations) should subdue its neighbors (i.e., cause low value activations); however, BN is now considered standard practice in the design of CNNs while LRN is mostly deprecated. Note that while LRN is usually performed after the nonlinear function, BN is usually performed between the CONV or FC layer and the nonlinear function. If BN is performed immediately after the CONV or FC layer, its computation can be folded into the weights of the CONV or FC layer resulting in no additional computation for inference.
2.3.6 COMPOUND LAYERS
The above primitive layers can be combined to form compound layers. For instance, attention layers are composed of matrix multiplications and feed-forward, fully connected layers [68]. Attention layers have become popular for processing a wide range of data including language and images and are commonly used in a type of DNNs called Transformers. We will discuss transformers in more detail in Section 2.5. Another example of a compound layer is the up-convolution layer [60], which performs zero-insertion (unpooling) on the input and then applies a convolutional layer.13 Up-convolution layers are typically used in DNNs such as General Adversarial Networks (GANs) and Auto Encoders (AEs) that process image data. We will discuss GANs and AEs in more detail in Section 2.5.
2.4 CONVOLUTIONAL NEURAL NETWORKS (CNNs)
CNNs are a common form of DNNs that are composed of multiple CONV layers, as shown in Figure 2.7. In such networks, each layer generates a successively higher-level abstraction of the input data, called a feature map (fmap), which preserves essential yet unique information. Modern CNNs are able to achieve superior performance by employing a very deep hierarchy of layers. CNNs are widely used in a variety of applications including image understanding [7], speech recognition [70], game play [10], robotics [42], etc. This book will focus on its use in image processing, specifically for the task of image classification [7]. Modern CNN models for image classification typically have 5 [7] to more than a 1,000 [24] CONV layers. A small number, e.g., 1 to 3, of FC layers are typically applied after the CONV layers for classification purposes.
Figure 2.7: Convolutional Neural Networks.
2.4.1 POPULAR CNN MODELS
Many CNN models have been developed over the past two decades. Each of these models are different in terms of number of layers, layer types, layer shapes (i.e., filter size, number of channels and filters), and connections between layers. Understanding these variations and trends is important for incorporating the right flexibility in any efficient DNN accelerator, as discussed in Chapter 3.
In this section, we will give an overview of various popular CNNs such as LeNet [71] as well as those that competed in and/or won the ImageNet Challenge [23], as shown in Figure 1.8, most of whose models with pre-trained weights are publicly available for download; the CNN models are summarized in Table 2.2. Two results for Top-5 error are reported. In the first row, the accuracy is boosted by using multiple crops from the image and an ensemble of multiple trained models (i.e., the CNN needs to be run several times); these results were used to compete in the ImageNet Challenge. The second row reports the accuracy if only a single crop was used (i.e., the CNN is run only once), which is more consistent with what would likely be deployed in real-time and/or energy-constrained applications.
Table 2.2: Summary of popular CNNs [7, 24, 71, 73, 74]. †Accuracy is measured based on Top-5 error on ImageNet [23] using multiple crops. ‡This version of LeNet-5 has 431k weights for the filters and requires 2.3M MACs per image, and uses ReLU rather than sigmoid.
LeNet [20] was one of the first CNN approaches introduced in 1989. It was designed for the task of digit classification in grayscale images of size 28×28. The most well known version, LeNet-5, contains two CONV layers followed by two FC layers [71]. Each CONV layer uses filters of size 5×5 (1 channel per filter) with 6 filters in the first layer and 16 filters in the second layer. Average pooling of 2×2 is used after each convolution and a sigmoid is used for the non-linearity. In total, LeNet requires 60k weights and 341k multiply-and-accumulates (MACs) per image. LeNet led to CNNs’ first commercial success, as it was deployed in ATMs to recognize digits for check deposits.
AlexNet [7] was the first CNN to win the ImageNet Challenge in 2012. It consists of five CONV layers followed by three FC layers. Within each CONV layer, there are 96 to 384 filters and the filter size ranges from 3×3 to 11×11, with 3 to 256 channels each. In the first layer, the three channels of the filter correspond to the red, green, and blue components of the input image. A ReLU nonlinearity is used in each layer. Max