Vivienne Sze

Efficient Processing of Deep Neural Networks


Скачать книгу

that even for the same DNN (e.g., AlexNet) the accuracy of these models can vary by around 1 to 2% depending on how the model was trained and tested, and thus the results do not always exactly match the original publication.

      These pre-trained models often are tied to a given framework. In order to facilitate easier exchange between different networks, Open Neural Network Exchange (ONNX) has been established as an open ecosystem for interchangeable DNN models [102]; the current participants include Amazon, Facebook, and Microsoft.

      It is important to factor in the difficulty of the task when comparing different DNN models. For instance, the task of classifying handwritten digits from the MNIST dataset [103] is much simpler than classifying an object into one of 1000 classes as is required for the ImageNet dataset [23] (Figure 2.15). It is expected that the size of the DNNs (i.e., number of weights) and the number of MACs will be larger for the more difficult task than the simpler task and thus require more energy and have lower throughput. For instance, LeNet-5[71] is designed for digit classification, while AlexNet[7], VGG-16[73], GoogLeNet[74], and ResNet[24] are designed for the 1000-class image classification.

      There are many AI tasks that come with publicly available datasets in order to evaluate the accuracy of a given DNN. Public datasets are important for comparing the accuracy of different approaches. The simplest and most common task in computer vision is image classification, which involves being given an entire image, and selecting 1 of N classes that the image most likely belongs to. There is no localization or detection.

      MNIST is a widely used dataset for digit classification that was introduced in 1998 [103]. It consists of 28×28 pixel grayscale images of handwritten digits. There are 10 classes (for 10 digits) and 60,000 training images and 10,000 test images. LeNet-5 was able to achieve an accuracy of 99.05% when MNIST was first introduced. Since then the accuracy has increased to 99.79% using regularization of neural networks with dropconnect [104]. Thus, MNIST is now considered a fairly easy dataset.

      CIFAR is a dataset that consists of 32×32 pixel colored images of various objects, which was released in 2009 [105]. CIFAR is a subset of the 80 million Tiny Image dataset [106]. CIFAR-10 is composed of 10 mutually exclusive classes. There are 50,000 training images (5000 per class) and 10,000 test images (1000 per class). A two-layer convolutional deep belief network was able to achieve 64.84% accuracy on CIFAR-10 when it was first introduced [107]. Since then the accuracy has increased to 96.53% using fractional max pooling [108].

      ImageNet is a large-scale image dataset that was first introduced in 2010; the dataset stabilized in 2012 [23]. It contains images of 256×256 pixel in color with 1000 classes. The classes are defined using the WordNet as a backbone to handle ambiguous word meanings and to combine together synonyms into the same object category. In other words, there is a hierarchy for the ImageNet categories. The 1000 classes were selected such that there is no overlap in the ImageNet hierarchy. The ImageNet dataset contains many fine-grained categories including 120 different breeds of dogs. There are 1.3M training images (732 to 1300 per class), 100,000 testing images (100 per class) and 50,000 validation images (50 per class).

      In summary of the various image classification datasets, it is clear that MNIST is a fairly easy dataset, while ImageNet is a more challenging one with a wider coverage of classes. Thus, in terms of evaluating the accuracy of a given DNN, it is important to consider that dataset upon which the accuracy is measured.

      Since the accuracy of the state-of-the-art DNNs are performing better than human-level accuracy on image classification tasks, the ImageNet Challenge has started to focus on more difficult tasks such as single-object localization and object detection. For single-object localization, the target object must be localized and classified (out of 1000 classes). The DNN outputs the top five categories and top five bounding box locations. There is no penalty for identifying an object that is in the image but not included in the ground truth. For object detection, all objects in the image must be localized and classified (out of 200 classes). The bounding box for all objects in these categories must be labeled. Objects that are not labeled are penalized as well as duplicated detections.

      Beyond ImageNet, there are also other popular image datasets for computer vision tasks. For object detection, there is the PASCAL VOC (2005-2012) dataset that contains 11k images representing 20 classes (27k object instances, 7k of which have detailed segmentation) [109]. For object detection, segmentation, and recognition in context, there is the M.S. COCO dataset with 2.5M labeled instances in 328k images (91 object categories) [110]; compared to ImageNet, COCO has fewer categories but more instances per category, which is useful for precise 2-D localization. COCO also has more labeled instances per image to potentially help with contextual information.

      Undoubtedly, both larger datasets and datasets for new domains will serve as important resources for profiling and exploring the efficiency of future DNN engines.

      The development resources presented in this section enable us to evaluate hardware using the appropriate DNN model and dataset. In particular, it’s important to realize that difficult tasks typically require larger models; for instance, LeNet would not apply to the ImageNet Challenge. In addition, different datasets are required for different tasks; for instance, self-driving cars require high-definition video, and thus a network trained on the low resolution ImageNet dataset may not be sufficient. To address these requirements, the number of datasets continues to grow at a rapid pace.