Vivienne Sze

Efficient Processing of Deep Neural Networks


Скачать книгу

dynamically adapt at runtime to changes in the DNN model or input data, while still maximally exploiting the flexibility of the hardware to improve efficiency.

      In summary, to assess the flexibility of DNN processors, its efficiency (e.g., inferences per second, inferences per joule) should be evaluated on a wide range of DNN models. The MLPerf benchmarking workloads are a good start; however, additional workloads may be needed to represent efficient techniques such as efficient network architectures, reduced precision and sparsity. The workloads should match the desired application. Ideally, since there can be many possible combinations, it would also be beneficial to define the range and limits of DNN models that can be efficiently supported on a given platform (e.g., maximum number of weights per filter or DNN model, minimum amount of sparsity, required structure of the sparsity, levels of precision such as 8-bit, 4-bit, 2-bit, or 1-bit, types of layers and activation functions, etc.).

      Scalability has become increasingly important due to the wide use cases for DNNs and emerging technologies used for scaling up not just the size of the chip, but also building systems with multiple chips (often referred to as chiplets) [123] or even wafer-scale chips [124]. Scalability refers to how well a design can be scaled up to achieve higher throughput and energy efficiency when increasing the amount of resources (e.g., the number of PEs and on-chip storage). This evaluation is done under the assumption that the system does not have to be significantly redesigned (e.g., the design only needs to be replicated) since major design changes can be expensive in terms of time and cost. Ideally, a scalable design can be used for low-cost embedded devices and high-performance devices in the cloud simply by scaling up the resources.

      Ideally, the throughput would scale linearly and proportionally with the number of PEs. Similarly, the energy efficiency would also improve with more on-chip storage, however, this would be likely be nonlinear (e.g., increasing the on-chip storage such that the entire DNN model fits on chip would result in an abrupt improvement in energy efficiency). In practice, this is often challenging due to factors such as the reduced utilization of PEs and the increased cost of data movement due to long distance interconnects.

      Scalability can be connected with cost efficiency by considering how inferences per second per cost (e.g., $) and inferences per joule per cost changes with scale. For instance, if throughput increases linearly with number of PEs, then the inferences per second per cost would be constant. It is also possible for the inferences per second per cost to improve super-linearly with increasing number of PEs, due to increased sharing of data across PEs.

      In summary, to understand the scalability of a DNN accelerator design, it is important to report its performance and efficiency metrics as the number of PEs and storage capacity increases. This may include how well the design might handle technologies used for scaling up, such as inter-chip interconnect.

      In summary, the evaluation process for whether a DNN system is a viable solution for a given application might go as follows:

      1. the accuracy determines if it can perform the given task;

      2. the latency and throughput determine if it can run fast enough and in real time;

      3. the energy and power consumption will primarily dictate the form factor of the device where the processing can operate;

      4. the cost, which is primarily dictated by the chip area and external memory bandwidth requirements, determines how much one would pay for this solution;

      5. flexibility determines the range of tasks it can support; and

      6. the scalability determines whether the same design effort can be amortized for deployment in multiple domains, (e.g., in the cloud and at the edge), and if the system can efficiently be scaled with DNN model size.

image

      A DNN-centric version of Little’s Law would have throughput measured in inferences per second, latency measured in seconds, and inferences-in-flight, as the tasks-in-flight equivalent, measured in the number of images in a batch being processed simultaneously. This helps to explain why increasing the number of inferences in flight to increase throughput may be counterproductive because some techniques that increase the number of inferences in flight (e.g., batching) also increase latency.

      Конец ознакомительного фрагмента.

      Текст предоставлен ООО «ЛитРес».

      Прочитайте эту книгу целиком, купив