layers 1, 2, and 5. To reduce computation, a stride of 4 is used at the first layer of the network. AlexNet introduced the use of LRN in layers 1 and 2 before the max pooling, though LRN is no longer popular in later CNN models. One important factor that differentiates AlexNet from LeNet is that the number of weights is much larger and the shapes vary from layer to layer. To reduce the amount of weights and computation in the second CONV layer, the 96 output channels of the first layer are split into two groups of 48 input channels for the second layer, such that the filters in the second layer only have 48 channels. This approach is referred to as “grouped convolution” and illustrated in Figure 2.8.14 Similarly, the weights in fourth and fifth layer are also split into two groups. In total, AlexNet requires 61M weights and 724M MACs to process one 227×227 input image.
Overfeat [72] has a very similar architecture to AlexNet with five CONV layers followed by three FC layers. The main differences are that the number of filters is increased for layers 3 (384 to 512), 4 (384 to 1024), and 5 (256 to 1024), layer 2 is not split into two groups, the first FC layer only has 3072 channels rather than 4096, and the input size is 231×231 rather than 227×227. As a result, the number of weights grows to 146M and the number of MACs grows to 2.8G per image. Overfeat has two different models: fast (described here) and accurate. The accurate model used in the ImageNet Challenge gives a 0.65% lower Top-5 error rate than the fast model at the cost of 1.9× more MACs.
VGG-16 [73] goes deeper to 16 layers consisting of 13 CONV layers followed by 3 FC layers. In order to balance out the cost of going deeper, larger filters (e.g., 5×5) are built from multiple smaller filters (e.g., 3×3), which have fewer weights, to achieve the same effective receptive fields, as shown in Figure 2.9a. As a result, all CONV layers have the same filter size of 3×3. In total, VGG-16 requires 138M weights and 15.5G MACs to process one 224×224 input image. VGG has two different models: VGG-16 (described here) and VGG-19. VGG-19 gives a 0.1% lower Top-5 error rate than VGG-16 at the cost of 1.27× more MACs.
GoogLeNet [74] goes even deeper with 22 layers. It introduced an inception module, shown in Figure 2.10, whose input is distributed through multiple feed-forward connections to several parallel layers. These parallel layers contain different sized filters (i.e., 1×1, 3×3, 5×5), along with 3×3 max-pooling, and their outputs are concatenated for the module output. Using multiple filter sizes has the effect of processing the input at multiple scales. For improved training speed, GoogLeNet is designed such that the weights and the activations, which are stored for backpropagation during training, could all fit into the GPU memory. In order to reduce the number of weights, 1×1 filters are applied as a “bottleneck” to reduce the number of channels for each filter [75], as shown in Figure 2.11. The 22 layers consist of three CONV layers, followed by nine inceptions modules (each of which are two CONV layers deep), and one FC layer. The number of FC layers was reduce from three to one using a global average pooling layer, which summarizes the large feature map from the CONV layers into one value; global pooling will be discussed in more detail in Section 9.1.2. Since its introduction in 2014, GoogLeNet (also referred to as Inception) has multiple versions: v1 (described here), v3,15 and v4. Inception-v3 decomposes the convolutions by using smaller 1-D filters, as shown in Figure 2.9b, to reduce number of MACs and weights in order to go deeper to 42 layers. In conjunction with batch normalization [66], v3 achieves over 3% lower Top-5 error than v1 with 2.5× more MACs [76]. Inception-v4 uses residual connections [77], described in the next section, for a 0.4% reduction in error.
Figure 2.8: An example of dividing feature map into two grouped convolutions. Each filter requires 2× fewer weights and multiplications.
Figure 2.9: Decomposing larger filters into smaller filters.
Figure 2.10: Inception module from GoogLeNet [74] with example channel lengths. Note that each CONV layer is followed by a ReLU (not drawn).
ResNet [24], also known as Residual Net, uses feed-forward connections that connects to layers beyond the immediate next layer (often referred to as residual, skip or identity connections); these connections enable a DNN with many layers (e.g., 34 or more) to be trainable. It was the first entry CNN in ImageNet Challenge that exceeded human-level accuracy with a Top-5 error rate below 5%. One of the challenges with deep networks is the vanishing gradient during training [78]; as the error backpropagates through the network the gradient shrinks, which affects the ability to update the weights in the earlier layers for very deep networks. ResNet introduces a “shortcut” module which contains an identity connection such that the weight layers (i.e., CONV layers) can be skipped, as shown in Figure 2.12. Rather than learning the function for the weight layers F(x), the shortcut module learns the residual mapping (F(x) = H(x) − x). Initially, F(x) is zero and the identity connection is taken; then gradually during training, the actual forward connection through the weight layer is used. ResNet also uses the “bottleneck” approach of using 1×1 filters to reduce the number of weights. As a result, the two layers in the shortcut module are replaced by three layers (1×1, 3×3, 1×1) where the first 1×1 layer reduces the number of activations and thus weights in the 3×3 layer, the last 1×1 layer restores the number of activations in the output of the third layer. ResNet-50 consists of one CONV layer, followed by 16 shortcut layers (each of which are 3 CONV layers deep), and 1 FC layer; it requires 25.5M weights and 3.9G MACs per image. There are various versions of ResNet with multiple depths (e.g., without bottleneck: 18, 34; with bottleneck: 50, 101, 152). The ResNet with 152 layers was the winner of the ImageNet Challenge requiring 11.3G MACs and 60M weights. Compared to ResNet-50, it reduces the Top-5 error by around 1% at the cost of 2.9× more MACs and 2.5× more weights.
Figure 2.11: Apply 1×1×C filter (usually referred to as 1×1) to capture cross-channel correlation, but no spatial correlation. This bottleneck approach reduces the number of channels in next layer assuming the number of filters applied (M) is less than the original number of channels (C).
Figure 2.12: Shortcut module from ResNet [24]. Note that ReLU following last CONV layer in shortcut is after the addition.
Several trends can be observed in the popular CNNs shown in Table 2.2. Increasing the depth of the network tends to provide higher accuracy. Controlling for number of weights, a deeper network can support a wider range of nonlinear functions that are more discriminative and also provides more levels of hierarchy in the learned representation [24, 73, 74, 79]. The number of filter shapes continues to vary across layers, thus flexibility is still important. Furthermore, most of the computation has been placed on CONV layers rather than FC layers. In addition, the number of weights in the FC layers is reduced and in most recent networks (since GoogLeNet) the CONV layers also dominate in terms of weights. Thus, the focus of hardware implementations targeted at CNNs should be on addressing the efficiency of the CONV layers, which in many domains are increasingly important.
Since ResNet, there have been several other notable networks that have been proposed to increase accuracy. DenseNet [84] extends the concept of skip connections by adding skip