scholarly journals Efficient Memory Organization for DNN Hardware Accelerator Implementation on PSoC

Electronics ◽  
2021 ◽  
Vol 10 (1) ◽  
pp. 94
Author(s):  
Antonio Rios-Navarro ◽  
Daniel Gutierrez-Galan ◽  
Juan Pedro Dominguez-Morales ◽  
Enrique Piñero-Fuentes ◽  
Lourdes Duran-Lopez ◽  
...  

The use of deep learning solutions in different disciplines is increasing and their algorithms are computationally expensive in most cases. For this reason, numerous hardware accelerators have appeared to compute their operations efficiently in parallel, achieving higher performance and lower latency. These algorithms need large amounts of data to feed each of their computing layers, which makes it necessary to efficiently handle the data transfers that feed and collect the information to and from the accelerators. For the implementation of these accelerators, hybrid devices are widely used, which have an embedded computer, where an operating system can be run, and a field-programmable gate array (FPGA), where the accelerator can be deployed. In this work, we present a software API that efficiently organizes the memory, preventing reallocating data from one memory area to another, which improves the native Linux driver with a 85% speed-up and reduces the frame computing time by 28% in a real application.

2020 ◽  
Vol 38 (5A) ◽  
pp. 690-697
Author(s):  
Ahmed Y. Yaseen ◽  
Afaneen A. Abbood

The capability to rapidly execute the power flow (PF) calculations permit engineers in assured with stay bigger assured within the dependability, protection, and economical operation of their system within the case of planned or unplanned instrumentality failures. The purpose of this work is to investigate the use of FPGA characteristics to speed up power flow computing time for the on-line monitoring system of a power system. The work comprises which is the development of the Power flow program using the Fast-decoupled method based on FPGA (Field Programmable Gate Array), and LABVIEW (graphical programming environment). The program delivered very satisfactory results to solve a 30-bus test system. These findings suggest that in general that differences between the proposed work and the conventional fast decoupled method are satisfactory. As for the execution time, because the FPGA uses parallel solutions, the performance of the proposed method is faster. Also, the engagement of the FPGA and the LabVIEW program presented an effective monitoring system for observing the power system.


2003 ◽  
Vol 15 (03) ◽  
pp. 109-114
Author(s):  
YANG-YAO NIU ◽  
SHOU-CHENG TCHENG

In this study, a parallel computing technology is applied on the simulation of aortic blood flow problems. A third-order upwind flux extrapolation with a dual-time integration method based on artificial compressibility solver is used to solve the Navier-Stokes equations. The original FORTRAN code is converted to the MPI code and tested on a 64-CPU IBM SP2 parallel computer and a 32-node PC Cluster. The test results show that a significant reduction of computing time in running the model and a super-linear speed up rate is achieved up to 32 CPUs at PC cluster. The speed up rate is as high as 49 for using IBM SP2 64 processors. The test shows very promising potential of parallel processing to provide prompt simulation of the current aortic flow problems.


2015 ◽  
Vol 2015 ◽  
pp. 1-15 ◽  
Author(s):  
Luis Andres Cardona ◽  
Carles Ferrer

The Internal Configuration Access Port (ICAP) is the core component of any dynamic partial reconfigurable system implemented in Xilinx SRAM-based Field Programmable Gate Arrays (FPGAs). We developed a new high speed ICAP controller, named AC_ICAP, completely implemented in hardware. In addition to similar solutions to accelerate the management of partial bitstreams and frames, AC_ICAP also supports run-time reconfiguration of LUTs without requiring precomputed partial bitstreams. This last characteristic was possible by performing reverse engineering on the bitstream. Besides, we adapted this hardware-based solution to provide IP cores accessible from the MicroBlaze processor. To this end, the controller was extended and three versions were implemented to evaluate its performance when connected to Peripheral Local Bus (PLB), Fast Simplex Link (FSL), and AXI interfaces of the processor. In consequence, the controller can exploit the flexibility that the processor offers but taking advantage of the hardware speed-up. It was implemented in both Virtex-5 and Kintex7 FPGAs. Results of reconfiguration time showed that run-time reconfiguration of single LUTs in Virtex-5 devices was performed in less than 5 μs which implies a speed-up of more than 380x compared to the Xilinx XPS_HWICAP controller.


Electronics ◽  
2019 ◽  
Vol 8 (6) ◽  
pp. 641 ◽  
Author(s):  
Miguel Rivera-Acosta ◽  
Susana Ortega-Cisneros ◽  
Jorge Rivera

This paper presents a platform that automatically generates custom hardware accelerators for convolutional neural networks (CNNs) implemented in field-programmable gate array (FPGA) devices. It includes a user interface for configuring and managing these accelerators. The herein-presented platform can perform all the processes necessary to design and test CNN accelerators from the CNN architecture description at both layer and internal parameter levels, training the desired architecture with any dataset and generating the configuration files required by the platform. With these files, it can synthesize the register-transfer level (RTL) and program the customized CNN accelerator into the FPGA device for testing, making it possible to generate custom CNN accelerators quickly and easily. All processes save the CNN architecture description are fully automatized and carried out by the platform, which manages third-party software to train the CNN and synthesize and program the generated RTL. The platform has been tested with the implementation of some of the CNN architectures found in the state-of-the-art for freely available datasets such as MNIST, CIFAR-10, and STL-10.


Electronics ◽  
2020 ◽  
Vol 9 (3) ◽  
pp. 449
Author(s):  
Mohammad Amir Mansoori ◽  
Mario R. Casu

Principal Component Analysis (PCA) is a technique for dimensionality reduction that is useful in removing redundant information in data for various applications such as Microwave Imaging (MI) and Hyperspectral Imaging (HI). The computational complexity of PCA has made the hardware acceleration of PCA an active research topic in recent years. Although the hardware design flow can be optimized using High Level Synthesis (HLS) tools, efficient high-performance solutions for complex embedded systems still require careful design. In this paper we propose a flexible PCA hardware accelerator in Field-Programmable Gate Arrays (FPGA) that we designed entirely in HLS. In order to make the internal PCA computations more efficient, a new block-streaming method is also introduced. Several HLS optimization strategies are adopted to create an efficient hardware. The flexibility of our design allows us to use it for different FPGA targets, with flexible input data dimensions, and it also lets us easily switch from a more accurate floating-point implementation to a higher speed fixed-point solution. The results show the efficiency of our design compared to state-of-the-art implementations on GPUs, many-core CPUs, and other FPGA approaches in terms of resource usage, execution time and power consumption.


VLSI Design ◽  
2012 ◽  
Vol 2012 ◽  
pp. 1-11
Author(s):  
M. Walton ◽  
O. Ahmed ◽  
G. Grewal ◽  
S. Areibi

Scatter Search is an effective and established population-based metaheuristic that has been used to solve a variety of hard optimization problems. However, the time required to find high-quality solutions can become prohibitive as problem sizes grow. In this paper, we present a hardware implementation of Scatter Search on a field-programmable gate array (FPGA). Our objective is to improve the run time of Scatter Search by exploiting the potentially massive performance benefits that are available through the native parallelism in hardware. When implementing Scatter Search we employ two different high-level languages (HLLs): Handel-C and Impulse-C. Our empirical results show that by effectively exploiting source-code optimizations, data parallelism, and pipelining, a 28x speed up over software can be achieved.


Electronics ◽  
2019 ◽  
Vol 8 (4) ◽  
pp. 413 ◽  
Author(s):  
Tuy Nguyen Tan ◽  
Hanho Lee

This paper presents a novel architecture for ring learning with errors (LWE) cryptoprocessors using an efficient approach in encryption and decryption operations. By scheduling multipliers to work in parallel, the encryption and decryption time are significantly reduced. In addition, polynomial multiplications are conducted using radix-2 and radix-8 multiple delay feedback (MDF) architecture-based number theoretic transform (NTT) multipliers to speed up the multiplication operation. To reduce the hardware complexity of an NTT multiplier, three bit-reverse operations during the NTT and inverse NTT (INTT) processes are removed. Polynomial additions in the ring-LWE encryption phase are also arranged to work simultaneously to reduce the latency. As a result, the proposed efficient-scheduling parallel multiplier-based ring-LWE cryptoprocessors can achieve higher throughput and efficiency compared with existing architectures. The proposed ring-LWE cryptoprocessors are synthesized and verified using Xilinx VIVADO on a Virtex-7 field programmable gate array (FPGA) board. With security parameters n = 512 and q = 12,289, the proposed cryptoprocessors using radix-2 single-path delay feedback (SDF), radix-2 MDF, and radix-8 MDF multipliers perform encryption in 4.58 μ s, 1.97 μ s, and 0.89 μ s, and decryption in 4.35 μ s, 1.82 μ s, and 0.71 μ s, respectively. A comparison of the obtained throughput and efficiency with those of previous studies proves that the proposed cryptoprocessors achieve a better performance.


Author(s):  
M. Y. Zhang ◽  
H. Zhang ◽  
L. L. Zheng

The smoothed particle hydrodynamics (SPH) method, one of meshfree methods, is developed to simulate the interaction between multi-droplets and substrate with solidification. However, the SPH method for this complicated problem needs a large amount of computing time, since it has to use a large number of the SPH particles to represent multidrops and substrate. All-pair search and linked list algorithms are popular in the neighbor search, which is the most time consuming part of the SPH calculation. Both algorithms are tested in this paper. For the solidification process, since the volume of the melt is decreased continuously, a new method is proposed to speed up the SPH calculation. The new treatment is used to handle the particles near the free surface and near the solidification interface Multi-droplets impinging on a smooth substrate in 2D and 3D are simulated to demonstrate the capability of current numerical method on simulating the spreading and solidification of multi-droplets.


Author(s):  
Sarmad Ismael ◽  
Omar Tareq ◽  
Yahya Taher Qassim

<p>Line plotting is the one of the basic operations in the scan conversion. Bresenham’s line drawing algorithm is an efficient and high popular algorithm utilized for this purpose. This algorithm starts from one end-point of the line to the other end-point by calculating one point at each step. As a result, the calculation time for all the points depends on the length of the line thereby the number of the total points presented. In this paper, we developed an approach to speed up the Bresenham algorithm by partitioning each line into number of segments, find the points belong to those segments and drawing them simultaneously to formulate the main line. As a result, the higher number of segments generated, the faster the points are calculated. By employing 32 cores in the Field Programmable Gate Array, a line of length 992 points is formulated in 0.31μs only. The complete system is implemented using Zybo board that contains the Xilinx Zynq-7000 chip (Z-7010).<em></em></p>


Sign in / Sign up

Export Citation Format

Share Document