29 7.eight 0.12 A5 259 three.9 0.12 A6 246 four.1 0.13 A7 492 two.0 0.13 A8 140 7.1 0.Future Net 2021, 13,16 of120 A1 – (13,eight)Quantity of
29 7.8 0.12 A5 259 three.9 0.12 A6 246 four.1 0.13 A7 492 2.0 0.13 A8 140 7.1 0.Future World-wide-web 2021, 13,16 of120 A1 – (13,eight)Quantity of Cores60 A8 – (13,4) 40 A6 – (four,8) A3 – (13,2) 20 A7 – (four,four)A4 – (8,eight);A2 – (13,four)A5 – (8,4)0,two,four,six,0 eight,0 10,0 Frames per Second (FPS)12,14,16,Figure 9. The number of cores versus frames per second of each and every configuration of the architecture. The graphs indicate the configuration as quantity of lines of cores and number of columns of cores).Table 9 presents the Tiny-YOLOv3 network execution occasions on various platforms: Intel i7-8700 @ 3.two GHz, GPU RTX 2080ti, and embedded GPU Jetson TX2 and Jetson Nano. The CPU and GPU benefits have been obtained using the original Tiny-YOLOv3 network [42] with floating-point representation. The CPU result corresponds towards the execution of Tiny-YOLOv3 implemented in C. The GPU result was obtained from the execution of Tiny-YOLOv3 within the Pytorch environment working with CUDA libraries.Table 9. Tiny-YOLOv3 execution times on multiple platforms. Software program Version Floating-point Floating-point Floating-point Floating-point Fixed-point-16 Fixed-point-8 Platform CPU (Intel i7-8700 @ 3.two GHz) GPU (RTX 2080ti) eGPU (Jetson TX2) [43] eGPU (Jetson Nano) [43] ZYNQ7020 ZYNQ7020 CNN (ms) 819.2 7.five 140 68 FPS 1.2 65.0 17 1.two 7.1 14.The Tiny-YOLOv3 on desktop CPUs is too slow. The inference time on an RTX 2080ti GPU showed a 109 Ziritaxestat MedChemExpress speedup versus the desktop CPU. Making use of the proposed accelerator, the inference instances have been 140 and 68 ms, inside the ZYNQ7020. The low-cost FPGA was 6X (16-bit) and 12X (8-bit) quicker than the CPU with a little drop in accuracy of 1.4 and 2.1 points, respectively. When compared with the embedded GPU, the proposed architecture was 15 slower. The benefit of applying the FPGA would be the energy consumption. Jetson TX2 includes a power close to 15 W, whilst the proposed accelerator has a power of about 0.5 W. The Nvidia Jetson Nano consumes a maximum of ten W but is around 12slower than the proposed architecture. 5.3. Comparison with Other FPGA Implementations The proposed implementation was compared with prior accelerators of TinyYOLOv3. We report the quantization, the operating frequency, the occupation of FPGA resources (DSP, LUTs, and BRAMs), and two performance metrics (execution time and frames per second). Furthermore, we regarded three metrics to quantify how efficientlyFuture Web 2021, 13,17 ofthe hardware resources have been being applied. Considering the fact that various options generally possess a different quantity of resources, it’s fair to consider metrics to somehow normalize the results prior to comparison. FSP/kLUT, FPS/DSP, and FPS/BRAM identify the number of each and every resource that may be applied to make a frame per second. The higher these values, the greater the utilization efficiency of those sources (see Table ten).Table 10. Performance comparison with other FPGA implementations. [38] Device Dataset Quant. Freq. (MHz) DSPs LUTs BRAMs Exec. (ms) FPS FPS/kLUT FPS/DSP FPS/BRAM Fmoc-Gly-Gly-OH Protocol ZYNQZU9EG Pedestrian indicators eight 9.6 104 16 one hundred 120 26 K 93 532.0 1.9 0.07 0.016 0.020 18 200 2304 49 K 70 [39] ZYNQ7020 [41] [40] Ours ZYNQVirtexVX485T US XCKU040 COCO dataset 16 143 832 139 K 384 24.4 32 0.23 0.038 0.16 one hundred 208 27.five K 120 140 7.1 0.26 0.034 0.8 one hundred 208 33.four K 120 68 14.7 0.44 0.068 0.The implementation in [39] may be the only earlier implementation with a Zynq 7020 SoC FPGA. This device has significantly fewer resources than the devices made use of in the other works. Our architecture implemented within the identical device was three.7X and 7.4X quicker, rely.