the total number of cycles on a single RISC-V core is 1.5x8.2 = 12.3 Mcycles.
Table 1.4. Performance data for the CIFAR-10 CNN graph
# | Layer type | ARC EM9D [ Mcycles ] | Processor A [ Mcycles ] | Processor B (RISC-V ISA) [ Mcycles ] |
0 | Permute | 0.01 | – | – |
1 | Convolution | 1.63 | 6.78 | – |
2 | Max Pooling | 0.14 | 0.34 | – |
3 | Convolution | 3.46 | 9.25 | – |
4 | Avg Pooling | 0.09 | 0.09 | – |
5 | Convolution | 1.76 | 4.88 | – |
6 | Avg Pooling | 0.07 | 0.04 | – |
7 | Fully-connected | 0.03 | 0.02 | – |
8 | Fully-connected | 0.001 | – | |
Total | 7.2 | 21.4 | 12.3 |
From Table 1.4, we conclude that the ARC EM9D processor spends 3x fewer cycles than processor A and 1.7x fewer cycles than the RISC-V core (processor B) for the same machine learning inference task, without using any specific accelerators. Thanks to the good cycle efficiency, the ARC EM9D processor can be clocked at a low frequency, which helps to save power in a smart IoT edge device.
1.4. Conclusion
Smart IoT edge devices that interact intelligently with their users are appearing in many application areas. These devices have diverse compute requirements, including a mixture of control processing, DSP and machine learning. Versatile processors are required to efficiently execute these different types of workloads. Furthermore, these processors must allow for easy customization to improve their efficiency for a specific application. Configurability and extensibility are two key mechanisms that provide such customization. Increasingly, IoT edge devices apply machine learning technology for processing captured sensor data, so that smart actions can be taken based on recognized patterns. We presented key processor features and a software library for the efficient implementation of low/mid-end machine learning inference. More specifically, we highlighted several processor capabilities, such as vector MAC instructions and XY memory with advanced AGUs, that are key to the efficient implementation of machine learning inference. The ARC EM9D processor is a universal processor for low-power IoT applications which is both configurable and extensible. The complete and highly optimized embARC MLI library makes effective use of the ARC EM9D processor to efficiently support a wide range of low/mid-end machine learning applications. We demonstrated this efficiency with excellent results for the CIFAR-10 benchmark.
1.5. References
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L., Gong, C., Hannun, A., Han, T., Johannes, L.V., Jiang, B., Ju, C., Jun, B., LeGresley, P., Lin, L., Liu, J., Liu, Y., Li, W., Li, X., Ma, D., Narang, S., Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S., Quan, Z., Raiman, J., Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A., Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z., Wang, Z., Wu, S., Wei, L., Xiao, B., Xie, W., Xie, Y., Yogatama, D., Yuan, B., Zhan, J., Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. Proceedings of the 33rd International Conference on Machine Learning – Volume 48, ICML-16, 173–182.
Croome, M. (2018). Using RISC-V in high computing, ultra-low power, programmable circuits for inference on battery operated edge devices [Online]. Available at: https://content.riscv.org/wp-content/uploads/2018/07/Shanghai-1325_GreenWaves_Shanghai-2018-MC-V2.pdf.
Dutt, N. and Choi, K. (2003). Configurable processors for embedded computing. IEEE Computer, 36(1), 120–123.
embARC Open Software Platform (2019). Available at: https://embarc.org/.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.G., Adam, H., and Kalenichenko, D. (2017). Quantization and training of neural networks for efficient integer-arithmetic-only inference. Computing Research Repository. Available at: http://arxiv.org/abs/1712.05877.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.B., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. Computing Research Repository. Available at: https://arxiv.org/abs/1408.5093.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical Report, University of Toronto, 2009.
Lai, L., Suda, N., and Chandra, V. (2018). CMSIS-NN: Efficient neural network kernels for arm cortex-M CPUs. Computing Research Repository. Available at: http://arxiv.org/abs/1801.06601.
Petrov-Savchenko, A. and van der Wolf, P. (2018). Get smart with NB-IoT: Efficient low-cost implementation of NB-IoT for smart applications. Technical paper, Synopsys [Online]. Available at: https://www.synopsys.com/dw/doc.php/wp/NB_IoT_for_Smart_Applications.pdf.