Architecture Design of Target Detection Hardware Accelerator Based on Heterogeneous FPGA
-
摘要: 采用粗细粒度优化、参数定点化与重排序等多种硬件加速方法,基于FPGA+SOC异构平台提出了一种低功耗目标检测加速器架构。针对现有研究的设计局限性,在Zynq 7000 系列FPGA上对YOLOv2算法进行新型多维度硬件加速,并对加速器性能和资源耗费进行深入分析建模,验证架构的合理性;为充分利用片上硬件资源,对各个模块进行特定优化设计,针对被忽视的底层繁琐数据访问,改进加速器数据访存机制,有效减少了系统传输时延。实验结果表明,该架构在PYNQ-Z2平台上获得了26.98 GOPs的性能,比现有的基于FPGA的目标检测平台提高了约38.71%,功耗仅为2.96 W,对目标检测算法的实际应用具有深远意义。Abstract: In recent years, with the continuous breakthrough in the field of algorithms, the current target detection algorithms have higher and higher computational complexity. In the forward inference stage, many practical applications often face low latency and strict power consumption restrictions. How to realize a low-power, low-cost, and high-performance target detection platform has gradually attracted more attention. As a high-performance, reconfigurable and low-cost embedded platform, Field Programmable Gate Array (FPGA) is becoming the key technology of algorithm application. In view of the above requirements, this paper proposes a low-power target detection accelerator architecture based on FPGA+SOC (System On Chip) heterogeneous platform by adopting various hardware acceleration methods such as coarse and fine granularity optimization, parameter fixed-point and reordering. Aiming at the design limitation of existing researches on Zynq 7000 series FPGA, this paper proposes a new multi-dimensional hardware acceleration of YOLOv2 (You Only Look Once) algorithm, and deeply analyzes and models the accelerator performance and resource consumption to verify the rationality of the architecture. In order to make full use of the on-chip hardware resources to optimize the design of each module, the accelerator data access mechanism is improved to effectively reduce the transmission delay of the system and improve the actual utilization rate of bus bandwidth. The fixed-point processing of floating-point numbers can reduce the processing load of FPGA and further accelerate the processing speed. It is shown via experiments that the architecture achieves 26.98 GOPs performance on PYNQ-Z2 platform, which is about 38.71% higher than the existing FPGA-based target detection platform, and the power consumption is only 2.96 W. Moreover, it has far-reaching significance for the application of target detection algorithm.
-
表 1 不同数据精度消耗资源对比
Table 1. Comparison of resource consumption with different data accuracy
Consume resources (Data precision) DSP LUT Adders(Float-32) 2 214 Multiplier(Float-32) 3 135 Adders(Fixed-16) − 47 Multiplier(Fixed-16) 1 101 表 2 加速器资源消耗
Table 2. Accelerator resource consumption
Resoures Used Available Utilization/% LUTs 35 977 53 200 67.62 FF 32 049 106 400 30.12 BRAM_18K 178 280 63.57 DSP48E 152 220 69.09 表 3 与其他FPGA加速器设计的比较
Table 3. Comparison with other FPGA accelerator designs
-
[1] SOKOLOVA A D, SAVCHENKO A V. Computation-efficient face recognition algorithm using a sequential analysis of high dimensional neural-net features[J]. Optical Memory and Neural Networks, 2020, 29(1): 19-29. doi: 10.3103/S1060992X2001004X [2] OH K, KIM S, OH I S. Salient explanation for fine-grained classification[J]. IEEE Access, 2020, 8: 6143361441. [3] YANG R, SINGH S K, TAVAKKOLI M, et al. CNN-LSTM deep learning architecture for computer vision-based modal frequency detection[J]. Mechanical Systems and Signal Processing, 2020, 144: 106885. doi: 10.1016/j.ymssp.2020.106885 [4] WANG D, XU K, JIA Q, et al. ABM-SpConv: A novel approach to FPGA-based acceleration of convolutional neural network inference[C]//Proceedings of the 56th Annual Design Automation Conference. Las Vegas, USA: ACM, 2019: 1-6. [5] CHEN X Z, KUNDU K, ZHU Y. 3D Object proposals for accurate object class detection[C]//International Conference on Neural Information Processing Systems. USA: MIT Press, 2015: 424-432. [6] DU Z, FASTHUBER R, CHEN T, et al. ShiDianNao: Shifting vision processing closer to the sensor[J]. ACM Sigarch Computer Architecture News, 2015, 43(3): 92-104. [7] ZHANG C, LI P, SUN G G, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]//Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. USA: ACM, 2015: 161-170. [8] LI H, FAN X, JIAO L, et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks[C]//2016 26th International Conference on Field Programmable Logic and Applications (FPL). Switzerland: IEEE, 2016: 1-9. [9] QIU J, WANG J, YAO S, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]//Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. USA: ACM, 2016: 26-35. [10] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2016: 779-788. [11] PEEMEN M, SETIO A, MESMAN B, et al. Memory-centric accelerator design for convolutional neural networks[C]// IEEE International Conference on Computer Design. USA: IEEE, 2013: 13-19. [12] NGUYEN D, NGUYEN T, KIM H, et al. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2019, 27(8): 1861-1873. doi: 10.1109/TVLSI.2019.2905242 [13] NAKAHARA H, YONEKAWA H, FUJII T, et al. A lightweight yolov2: A binarized CNN with a parallel support vector regression for an FPGA[C]//Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. USA: ACM, 2018: 31-40. [14] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2016: 770-778. [15] 赵澜涛, 林家骏. 基于双路CNN的多姿态人脸识别方法[J]. 华东理工大学学报(自然科学版), 2019, 45(3): 466-470. [16] LARKIN D, KINANE A, O’CONNOR N. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices[C]//International Conference on Neural Information Processing. Berlin, Heidelberg: Springer, 2006: 1178-1188. [17] FARABET C, LECUN Y, KAVUKCUOGLU K, et al. Large-scale FPGA-based convolutional networks[J]. Scaling up Machine Learning: Parallel and Distributed Approaches, 2011, 13(3): 399-419. [18] CHEN T, DU Z, SUN N, et al. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning[J]. ACM Sigarch Computer Architecture News, 2014, 42(1): 269-284. doi: 10.1145/2654822.2541967 [19] CHEN Y, LUO T, LIU S, et al. Dadiannao: A machine-learning supercomputer[C]//2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. UK: IEEE, 2014: 609-622. [20] 朱雯文, 叶西宁. 基于卷积神经网络的手势识别算法[J]. 华东理工大学学报(自然科学版), 2018, 44(2): 260-269. [21] IOFFE S, SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[C]// Proceedings of the 32nd International Conference on Machine Learning. Brookline: JMLR, 2015: 448-456. [22] ZHAO R Z, NIU X Y, WU Y J, et al. Optimizing CNN-based object detection algorithms on embedded FPGA platforms[C]//Proceedings of the 13th International Symposium on Applied Reconfigurable Computing. Berlin, Heidelberg: Springer, 2017: 255-267. [23] WAI Y J, BIN MOHD Y USSOF Z, BIN SALIM S I, et al. Fixed point implementation of Tiny- Yolo- v2 using OpenCL on FPGA[J]. International Journal of Advanced Computer Science and Applications, 2018, 9(10): 506-512. -