Overcoming the Challenges of Porting OpenCV to TI's Embedded ARM + DSP Platforms

2012 ◽  
Vol 49 (3) ◽  
pp. 260-274 ◽  
Author(s):  
Joseph Coombs ◽  
Rahul Prabhu ◽  
Greg Peake

The growing performance and decreasing price of embedded processors are opening many doors, for both developers in the industry and in academia. However, the complexities of these systems can create serious developmental bottlenecks. Sophisticated software packages such as OpenCV can assist in both the functional development and educational aspects of these otherwise complex applications; such tools lend themselves very well to use by the academic community, in particular in providing examples of algorithm implementation. However the task of migrating this software to embedded platforms poses its own challenges. This paper will review how to mitigate some of these issues, including C++ implementation, memory constraints, floating-point support, and opportunities to maximise performance using vendor-optimised libraries and integrated accelerators or co-processors. Finally, we will introduce a new effort by Texas Instruments to optimise vision systems by running OpenCV on the C6000™ digital signal processor architecture. Benchmarks will show the advantage of using the DSP by comparing the performance of a DSP+ARM® system-on-chip (SoC) processor against an ARM-only device.

2021 ◽  
Vol 27 (3) ◽  
pp. 57-70
Author(s):  
Damjan M. Rakanovic ◽  
Vuk Vranjkovic ◽  
Rastislav J. R. Struharik

Paper proposes a two-step Convolutional Neural Network (CNN) pruning algorithm and resource-efficient Field-programmable gate array (FPGA) CNN accelerator named “Argus”. The proposed CNN pruning algorithm first combines similar kernels into clusters, which are then pruned using the same regular pruning pattern. The pruning algorithm is carefully tailored for FPGAs, considering their resource characteristics. Regular sparsity results in high Multiply-accumulate (MAC) efficiency, reducing the amount of logic required to balance workloads among different MAC units. As a result, the Argus accelerator requires about 170 Look-up tables (LUTs) per Digital Signal Processor (DSP) block. This number is close to the average LUT/DPS ratio for various FPGA families, enabling balanced resource utilization when implementing Argus. Benchmarks conducted using Xilinx Zynq Ultrascale + Multi-Processor System-on-Chip (MPSoC) indicate that Argus is achieving up to 25 times higher frames per second than NullHop, 2 and 2.5 times higher than NEURAghe and Snowflake, respectively, and 2 times higher than NVDLA. Argus shows comparable performance to MIT’s Eyeriss v2 and Caffeine, requiring up to 3 times less memory bandwidth and utilizing 4 times fewer DSP blocks, respectively. Besides the absolute performance, Argus has at least 1.3 and 2 times better GOP/s/DSP and GOP/s/Block-RAM (BRAM) ratios, while being competitive in terms of GOP/s/LUT, compared to some of the state-of-the-art solutions.


2014 ◽  
Vol 668-669 ◽  
pp. 857-861
Author(s):  
Peng Fei Hu ◽  
Yu Xiang Yuan ◽  
Zhi Juan Qu ◽  
Xue Ping Jiang

To improve the reliability and integration of relay protection devices in power, the system on chip design for multi-principle of relay protection on FPGA is proposed. The data acquisition, digital signal processing, hardware protection algorithm, FPGA and MCU process scheduling, MCU and peripheral devices communication are designed, the hardware compilation model is set up by QuartusII on FPGA, and the simulation and experimental verification are performed. The results show that the proposed system can improve the speed of hardware protection and reduce the volume of the device, and has reconstruction on architecture.


IEEE Micro ◽  
1988 ◽  
Vol 8 (6) ◽  
pp. 30-48 ◽  
Author(s):  
M.L. Fuccio ◽  
R.N. Gadenz ◽  
C.J. Garen ◽  
J.M. Huser ◽  
B. Ng ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document