29 7.8 0.12 A5 259 3.9 0.12 A6 246 4.1 0.13 A7 492 2.0 0.13 A8 140 7.1 0.Future World-wide-web 2021, 13,16 of120 A1 – (13,eight)Quantity of
29 7.eight 0.12 A5 259 three.9 0.12 A6 246 4.1 0.13 A7 492 2.0 0.13 A8 140 7.1 0.Future World wide web 2021, 13,16 of120 A1 – (13,eight)Quantity of Cores60 A8 – (13,four) 40 A6 – (4,8) A3 – (13,two) 20 A7 – (four,four)A4 – (eight,8);A2 – (13,4)A5 – (8,four)0,two,four,six,0 eight,0 10,0 Frames per Second (FPS)12,14,16,Figure 9. The number of cores versus frames per second of every configuration of your architecture. The graphs indicate the configuration as number of lines of cores and number of columns of cores).Table 9 presents the Tiny-YOLOv3 network execution occasions on a number of platforms: Intel i7-8700 @ three.2 GHz, GPU RTX 2080ti, and embedded GPU Jetson TX2 and Jetson Nano. The CPU and GPU benefits had been obtained using the original Tiny-YOLOv3 network [42] with floating-point representation. The CPU outcome corresponds towards the execution of Tiny-YOLOv3 implemented in C. The GPU result was obtained in the execution of Tiny-YOLOv3 within the Pytorch environment working with CUDA libraries.Table 9. Tiny-YOLOv3 execution instances on a number of platforms. Computer software Version Floating-point Floating-point Floating-point Floating-point Fixed-point-16 Fixed-point-8 Platform CPU (Intel i7-8700 @ 3.2 GHz) GPU (RTX 2080ti) eGPU (Jetson TX2) [43] eGPU (Jetson Nano) [43] ZYNQ7020 ZYNQ7020 CNN (ms) 819.two 7.5 140 68 FPS 1.2 65.0 17 1.two 7.1 14.The Tiny-YOLOv3 on desktop CPUs is too slow. The inference time on an RTX 2080ti GPU showed a 109 speedup versus the desktop CPU. Employing the proposed accelerator, the inference times had been 140 and 68 ms, inside the ZYNQ7020. The low-cost FPGA was 6X (16-bit) and 12X (8-bit) more quickly than the CPU having a tiny drop in accuracy of 1.four and two.1 points, respectively. Compared to the embedded GPU, the proposed architecture was 15 slower. The benefit of using the FPGA will be the energy consumption. Jetson TX2 has a power close to 15 W, when the proposed accelerator has a power of around 0.5 W. The Nvidia Jetson Nano consumes a maximum of 10 W but is about 12slower than the proposed architecture. 5.three. Comparison with Other FPGA Implementations The proposed implementation was compared with previous accelerators of TinyYOLOv3. We report the quantization, the operating frequency, the occupation of FPGA resources (DSP, LUTs, and BRAMs), and two overall performance metrics (execution time and frames per second). Additionally, we viewed as 3 metrics to quantify how efficientlyFuture World-wide-web 2021, 13,17 ofthe hardware sources have been becoming employed. Considering that various solutions typically possess a unique quantity of sources, it’s fair to consider metrics to somehow normalize the results before comparison. FSP/kLUT, FPS/DSP, and FPS/BRAM ascertain the amount of each resource that is definitely applied to produce a frame per second. The higher these values, the greater the utilization efficiency of those sources (see Table 10).Table ten. Overall performance comparison with other FPGA implementations. [38] PX-478 supplier Device Dataset Quant. Freq. (MHz) DSPs LUTs BRAMs Exec. (ms) FPS FPS/kLUT FPS/DSP FPS/BRAM ZYNQZU9EG Pedestrian signs eight 9.six 104 16 one Ziritaxestat Biological Activity hundred 120 26 K 93 532.0 1.9 0.07 0.016 0.020 18 200 2304 49 K 70 [39] ZYNQ7020 [41] [40] Ours ZYNQVirtexVX485T US XCKU040 COCO dataset 16 143 832 139 K 384 24.four 32 0.23 0.038 0.16 one hundred 208 27.5 K 120 140 7.1 0.26 0.034 0.eight one hundred 208 33.four K 120 68 14.7 0.44 0.068 0.The implementation in [39] is the only previous implementation having a Zynq 7020 SoC FPGA. This device has drastically fewer resources than the devices utilized in the other functions. Our architecture implemented inside the similar device was 3.7X and 7.4X faster, depend.