Vivienne Sze

Efficient Processing of Deep Neural Networks


Скачать книгу

multiple previous layers to strengthen feature map propagation and feature reuse. This concept, commonly referred to as feature aggregation, continues to be widely explored. WideNet [85] proposes increasing the width (i.e., the number of filters) rather than depth of network, which has the added benefit that increasing width is more parallel-friendly than increasing depth. ResNeXt [86] proposes increasing the number of convolution groups (referred to as cardinality) instead of depth and width of network and was used as part of the winning entry for ImageNet in 2017. Finally, EfficientNet [87] proposes uniformly scaling all dimensions including depth, width, and resolution rather than focusing on a single dimension since there is an interplay between the different dimensions (e.g., to support higher input image resolution, the DNN needs higher depth to increase the receptive field and higher width to capture more finegrained patterns). WideNet, ResNeXt, and EfficientNet demonstrate that there exists methods beyond increasing depth for increasing accuracy, and thus highlights that there remains much to be explored and understood about the relationship between layer shape, number of layers, and accuracy.

image

      Figure 2.13: Auto Encoder network for semantic segmentation. Feature maps along with pooling and upsampling layers are shown. (Figure adapted from [92].)

      There are other types of DNNs beyond CNNs including Recurrent Neural Networks (RNNs) [88, 89], Transformers [68], Auto Encoders (AEs) [90], and General Adversarial Networks (GANs) [91]. The diverse types of DNNs allow them to handle a wide range of inputs for a wide range of tasks. For instance, RNNs and Transformers are often used to handle sequential data that can have variable length (e.g., audio for speech recognition, or text for natural language processing). AEs and GANs can be used to generate dense output predictions by combining encoder and decoder networks. Example applications that use AEs include predicting pixel-wise depth values for depth estimation [64] and assigning pixel-wise class labels for semantic segmentation [92], as shown in Figure 2.13. Example applications that use GANs to generate images with the same statistics as the training set include image synthesis [93] and style transfer [94].

image

      Figure 2.14: Dependencies in RNN are in both the time and depth dimension. The same weights (Wi) are used across time, while different weights are used across depth. (Figure adapted from [4].)

      While their applications may differ from the CNNs described in Section 2.4, many of the building blocks and primitive layers are similar. For instance, RNNs and transformers heavily rely on matrix multiplications, which means that they have similar challenges as FC layers (e.g., they are memory bound due to lack of data reuse); thus, many of the techniques used to accelerate FC layers can also be used to accelerate RNNs and transformers (e.g., tiling discussed in Chapter 4, network pruning discussed in Chapter 8, etc.). Similarly, the decoder network of GANs and AEs for image processing use up-convolution layers, which involves upsampling the input feature map using zero insertion (unpooling) before applying a convolution; thus, many of the techniques used to accelerate CONV layers can also be used to accelerate the decoder network of GANs and AEs for image processing (e.g., exploit input activation sparsity discussed in Chapter 8).

      While the dominant compute aspect of these DNNs are similar to CNNs, they do often require some other forms of compute. For instance, RNNs, particularly Long Short-Term Memory networks (LSTMs) [95], require support of element-wise multiplications as well a variety of nonlinear functions (sigmoid, tanh), unlike CNNs which typically only use ReLU. However, these operations do not tend to dominate run-time or energy consumption; they can be computed in software [96] or the nonlinear functions can be approximated by piecewise linear look up tables [97]. For GANs and AEs, additional support is required for upsampling.

      Finally, RNNs have additional dependencies since the output of a layer is fed back to its input, as shown in Figure 2.14. For instance, the inputs to layer i at time t depends on the output of layer i − 1 at time t and layer i at time t − 1. This is similar to the dependency across layers, in that the output of layer i is the input to layer i + 1. These dependencies limit what inputs can be processed in parallel (e.g., within the same batch). For DNNs with feed-forward layers, any inputs can be processed at the same time (i.e., batch size greater than one); however, multiple layers of the same input cannot be processed at the same time (e.g., layers i and i C 1). In contrast, RNNs can only process multiple inputs at the same time if the inputs are not sequentially dependent; in other words, RNNs can process two separate sequences at the same time, but not multiple elements within the sequence (e.g., inputs t and t C 1 of the same sequence) and not multiple layers of the same input (which is similar to feed-forward networks).

      One of the key factors that has enabled the rapid development of DNNs is the set of development resources that have been made available by the research community and industry. These resources are also key to the development of DNN accelerators by providing characterizations of the workloads and facilitating the exploration of trade-offs in model complexity and accuracy. This section will describe these resources such that those who are interested in this field can quickly get started.

      For ease of DNN development and to enable the sharing of trained networks, several deep learning frameworks have been developed from various sources. These open-source libraries contain software libraries for DNNs. Caffe was made available in 2014 from UC Berkeley [59]. It supports C, C++, Python, and MATLAB. Tensorflow [98] was released by Google in 2015, and supports C++ and Python; it also supports multiple CPUs and GPUs and has more flexibility than Caffe, with the computation expressed as dataflow graphs to manage the “tensors” (multidimensional arrays). Another popular framework is Torch, which was developed by Facebook and NYU and supports C, C++, and Lua; PyTorch [99] is its successor and is built in Python. There are several other frameworks such as Theano, MXNet, CNTK, which are described in [100]. There are also higher-level libraries that can run on top of the aforementioned frameworks to provide a more universal experience and faster development. One example of such libraries is Keras, which is written in Python and supports Tensorflow, CNTK, and Theano.

      The existence of such frameworks are not only a convenient aid for DNN researchers and application designers, but they are also invaluable for engineering high performance or more efficient DNN computation engines. In particular, because the frameworks make heavy use of a set of primitive operations, such as the processing of a CONV layer, they can incorporate use of optimized software or hardware accelerators. This acceleration is transparent to the user of the framework. Thus, for example, most frameworks can use Nvidia’s cuDNN library for rapid execution on Nvidia GPUs. Similarly, transparent incorporation of dedicated hardware accelerators can be achieved as was done with the Eyeriss chip using Caffe [101].

      Finally, these frameworks are a valuable source of workloads for hardware researchers. They can be used to drive experimental designs for different workloads, for profiling different workloads and for exploring hardware-algorithm trade-offs.

image

      Figure 2.15: MNIST (10 classes, 60k training, 10k testing) [103] versus ImageNet (1000 classes, 1.3M training, 100k testing) [23] dataset.

      Pretrained DNN models can be downloaded from various websites [80–83] for the various different frameworks.