Efficient Memory Organization for DNN Hardware Accelerator Implementation on PSoC

Antonio Rios-Navarro; Daniel Gutierrez-Galan; Juan Pedro Dominguez-Morales; Enrique Piñero-Fuentes; Lourdes Duran-Lopez; Ricardo Tapiador-Morales; Manuel Jesús Dominguez-Morales

doi:10.3390/electronics10010094

Efficient Memory Organization for DNN Hardware Accelerator Implementation on PSoC

Electronics ◽

10.3390/electronics10010094 ◽

2021 ◽

Vol 10 (1) ◽

pp. 94

Author(s):

Antonio Rios-Navarro ◽

Daniel Gutierrez-Galan ◽

Juan Pedro Dominguez-Morales ◽

Enrique Piñero-Fuentes ◽

Lourdes Duran-Lopez ◽

...

Keyword(s):

Computing Time ◽

Hardware Accelerators ◽

Hardware Accelerator ◽

Embedded Computer ◽

Field Programmable ◽

Speed Up ◽

Memory Area ◽

Data Transfers ◽

Hybrid Devices ◽

Efficient Memory

The use of deep learning solutions in different disciplines is increasing and their algorithms are computationally expensive in most cases. For this reason, numerous hardware accelerators have appeared to compute their operations efficiently in parallel, achieving higher performance and lower latency. These algorithms need large amounts of data to feed each of their computing layers, which makes it necessary to efficiently handle the data transfers that feed and collect the information to and from the accelerators. For the implementation of these accelerators, hybrid devices are widely used, which have an embedded computer, where an operating system can be run, and a field-programmable gate array (FPGA), where the accelerator can be deployed. In this work, we present a software API that efficiently organizes the memory, preventing reallocating data from one memory area to another, which improves the native Linux driver with a 85% speed-up and reduces the frame computing time by 28% in a real application.

Download Full-text

Study of Power System Load Flow Using FPGA and LabVIEW

Engineering and Technology Journal ◽

10.30684/etj.v38i5a.346 ◽

2020 ◽

Vol 38 (5A) ◽

pp. 690-697

Author(s):

Ahmed Y. Yaseen ◽

Afaneen A. Abbood

Keyword(s):

Power System ◽

Monitoring System ◽

Power Flow ◽

Computing Time ◽

Test System ◽

Load Flow ◽

Field Programmable ◽

Speed Up ◽

On Line ◽

Decoupled Method

The capability to rapidly execute the power flow (PF) calculations permit engineers in assured with stay bigger assured within the dependability, protection, and economical operation of their system within the case of planned or unplanned instrumentality failures. The purpose of this work is to investigate the use of FPGA characteristics to speed up power flow computing time for the on-line monitoring system of a power system. The work comprises which is the development of the Power flow program using the Fast-decoupled method based on FPGA (Field Programmable Gate Array), and LABVIEW (graphical programming environment). The program delivered very satisfactory results to solve a 30-bus test system. These findings suggest that in general that differences between the proposed work and the conventional fast decoupled method are satisfactory. As for the execution time, because the FPGA uses parallel solutions, the performance of the proposed method is faster. Also, the engagement of the FPGA and the LabVIEW program presented an effective monitoring system for observing the power system.

Download Full-text

COMPUTATIONS OF PULSATILE AORTIC BLOOD FLOW PROBLEMS ON PARALLEL COMPUTERS

Biomedical Engineering Applications Basis and Communications ◽

10.4015/s1016237203000171 ◽

2003 ◽

Vol 15 (03) ◽

pp. 109-114

Author(s):

YANG-YAO NIU ◽

SHOU-CHENG TCHENG

Keyword(s):

Blood Flow ◽

Stokes Equations ◽

Computing Time ◽

Time Integration ◽

Parallel Computer ◽

Aortic Blood Flow ◽

Pc Cluster ◽

Flow Problems ◽

Aortic Blood ◽

Speed Up

In this study, a parallel computing technology is applied on the simulation of aortic blood flow problems. A third-order upwind flux extrapolation with a dual-time integration method based on artificial compressibility solver is used to solve the Navier-Stokes equations. The original FORTRAN code is converted to the MPI code and tested on a 64-CPU IBM SP2 parallel computer and a 32-node PC Cluster. The test results show that a significant reduction of computing time in running the model and a super-linear speed up rate is achieved up to 32 CPUs at PC cluster. The speed up rate is as high as 49 for using IBM SP2 64 processors. The test shows very promising potential of parallel processing to provide prompt simulation of the current aortic flow problems.

Download Full-text

AC_ICAP: A Flexible High Speed ICAP Controller

International Journal of Reconfigurable Computing ◽

10.1155/2015/314358 ◽

2015 ◽

Vol 2015 ◽

pp. 1-15 ◽

Cited By ~ 6

Author(s):

Luis Andres Cardona ◽

Carles Ferrer

Keyword(s):

High Speed ◽

Access Port ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Speed Up ◽

Run Time ◽

Ip Cores ◽

Automatic Tool for Fast Generation of Custom Convolutional Neural Networks Accelerators for FPGA

Electronics ◽

10.3390/electronics8060641 ◽

2019 ◽

Vol 8 (6) ◽

pp. 641 ◽

Cited By ~ 7

Author(s):

Miguel Rivera-Acosta ◽

Susana Ortega-Cisneros ◽

Jorge Rivera

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

State Of The Art ◽

Third Party ◽

Hardware Accelerators ◽

Field Programmable ◽

Custom Hardware ◽

Architecture Description ◽

Automatic Tool

This paper presents a platform that automatically generates custom hardware accelerators for convolutional neural networks (CNNs) implemented in field-programmable gate array (FPGA) devices. It includes a user interface for configuring and managing these accelerators. The herein-presented platform can perform all the processes necessary to design and test CNN accelerators from the CNN architecture description at both layer and internal parameter levels, training the desired architecture with any dataset and generating the configuration files required by the platform. With these files, it can synthesize the register-transfer level (RTL) and program the customized CNN accelerator into the FPGA device for testing, making it possible to generate custom CNN accelerators quickly and easily. All processes save the CNN architecture description are fully automatized and carried out by the platform, which manages third-party software to train the CNN and synthesize and program the generated RTL. The platform has been tested with the implementation of some of the CNN architectures found in the state-of-the-art for freely available datasets such as MNIST, CIFAR-10, and STL-10.

Download Full-text

High Level Design of a Flexible PCA Hardware Accelerator Using a New Block-Streaming Method

Electronics ◽

10.3390/electronics9030449 ◽

2020 ◽

Vol 9 (3) ◽

pp. 449

Author(s):

Mohammad Amir Mansoori ◽

Mario R. Casu

Keyword(s):

High Performance ◽

Principal Component ◽

Hardware Acceleration ◽

Design Flow ◽

Hardware Accelerator ◽

Field Programmable ◽

Point Solution ◽

Active Research ◽

High Level ◽

Many Core

Principal Component Analysis (PCA) is a technique for dimensionality reduction that is useful in removing redundant information in data for various applications such as Microwave Imaging (MI) and Hyperspectral Imaging (HI). The computational complexity of PCA has made the hardware acceleration of PCA an active research topic in recent years. Although the hardware design flow can be optimized using High Level Synthesis (HLS) tools, efficient high-performance solutions for complex embedded systems still require careful design. In this paper we propose a flexible PCA hardware accelerator in Field-Programmable Gate Arrays (FPGA) that we designed entirely in HLS. In order to make the internal PCA computations more efficient, a new block-streaming method is also introduced. Several HLS optimization strategies are adopted to create an efficient hardware. The flexibility of our design allows us to use it for different FPGA targets, with flexible input data dimensions, and it also lets us easily switch from a more accurate floating-point implementation to a higher speed fixed-point solution. The results show the efficiency of our design compared to state-of-the-art implementations on GPUs, many-core CPUs, and other FPGA approaches in terms of resource usage, execution time and power consumption.

Download Full-text

An Empirical Investigation on System and Statement Level Parallelism Strategies for Accelerating Scatter Search Using Handel-C and Impulse-C

VLSI Design ◽

10.1155/2012/793196 ◽

2012 ◽

Vol 2012 ◽

pp. 1-11

Author(s):

M. Walton ◽

O. Ahmed ◽

G. Grewal ◽

S. Areibi

Keyword(s):

Optimization Problems ◽

Scatter Search ◽

Population Based ◽

Field Programmable ◽

Speed Up ◽

Time Required ◽

High Level ◽

Level Parallelism ◽

Code Optimizations ◽

Established Population

Scatter Search is an effective and established population-based metaheuristic that has been used to solve a variety of hard optimization problems. However, the time required to find high-quality solutions can become prohibitive as problem sizes grow. In this paper, we present a hardware implementation of Scatter Search on a field-programmable gate array (FPGA). Our objective is to improve the run time of Scatter Search by exploiting the potentially massive performance benefits that are available through the native parallelism in hardware. When implementing Scatter Search we employ two different high-level languages (HLLs): Handel-C and Impulse-C. Our empirical results show that by effectively exploiting source-code optimizations, data parallelism, and pipelining, a 28x speed up over software can be achieved.

Download Full-text

Efficient-Scheduling Parallel Multiplier-Based Ring-LWE Cryptoprocessors

Electronics ◽

10.3390/electronics8040413 ◽

2019 ◽

Vol 8 (4) ◽

pp. 413 ◽

Cited By ~ 2

Author(s):

Tuy Nguyen Tan ◽

Hanho Lee

Keyword(s):

Field Programmable Gate Array ◽

Path Delay ◽

Hardware Complexity ◽

Parallel Multiplier ◽

Multiplication Operation ◽

Encryption And Decryption ◽

Field Programmable ◽

Speed Up ◽

Learning With Errors ◽

Single Path

This paper presents a novel architecture for ring learning with errors (LWE) cryptoprocessors using an efficient approach in encryption and decryption operations. By scheduling multipliers to work in parallel, the encryption and decryption time are significantly reduced. In addition, polynomial multiplications are conducted using radix-2 and radix-8 multiple delay feedback (MDF) architecture-based number theoretic transform (NTT) multipliers to speed up the multiplication operation. To reduce the hardware complexity of an NTT multiplier, three bit-reverse operations during the NTT and inverse NTT (INTT) processes are removed. Polynomial additions in the ring-LWE encryption phase are also arranged to work simultaneously to reduce the latency. As a result, the proposed efficient-scheduling parallel multiplier-based ring-LWE cryptoprocessors can achieve higher throughput and efficiency compared with existing architectures. The proposed ring-LWE cryptoprocessors are synthesized and verified using Xilinx VIVADO on a Virtex-7 field programmable gate array (FPGA) board. With security parameters n = 512 and q = 12,289, the proposed cryptoprocessors using radix-2 single-path delay feedback (SDF), radix-2 MDF, and radix-8 MDF multipliers perform encryption in 4.58 μ s, 1.97 μ s, and 0.89 μ s, and decryption in 4.35 μ s, 1.82 μ s, and 0.71 μ s, respectively. A comparison of the obtained throughput and efficiency with those of previous studies proves that the proposed cryptoprocessors achieve a better performance.

Download Full-text

Smoothed Particles Hydrodynamics Method for Interaction Between Multi-Droplets and Substrate

Heat Transfer, Volume 3 ◽

10.1115/imece2006-14073 ◽

2006 ◽

Author(s):

M. Y. Zhang ◽

H. Zhang ◽

L. L. Zheng

Keyword(s):

Computing Time ◽

Solidification Process ◽

Sph Method ◽

Neighbor Search ◽

Solidification Interface ◽

Particle Hydrodynamics ◽

Smoothed Particle ◽

Speed Up ◽

New Treatment ◽

2D And 3D

The smoothed particle hydrodynamics (SPH) method, one of meshfree methods, is developed to simulate the interaction between multi-droplets and substrate with solidification. However, the SPH method for this complicated problem needs a large amount of computing time, since it has to use a large number of the SPH particles to represent multidrops and substrate. All-pair search and linked list algorithms are popular in the neighbor search, which is the most time consuming part of the SPH calculation. Both algorithms are tested in this paper. For the solidification process, since the volume of the melt is decreased continuously, a new method is proposed to speed up the SPH calculation. The new treatment is used to handle the particles near the free surface and near the solidification interface Multi-droplets impinging on a smooth substrate in 2D and 3D are simulated to demonstrate the capability of current numerical method on simulating the spreading and solidification of multi-droplets.

Download Full-text

Hardware accelerator to speed up packet processing in NDN router

Computer Communications ◽

10.1016/j.comcom.2016.06.004 ◽

2016 ◽

Vol 91-92 ◽

pp. 109-119 ◽

Cited By ~ 6

Author(s):

Weiwen Yu ◽

Derek Pao

Keyword(s):

Hardware Accelerator ◽

Packet Processing ◽

Speed Up

Download Full-text

Hardware/software co-design for a parallel three-dimensional bresenham’s algorithm

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i1.pp148-156 ◽

2019 ◽

Vol 9 (1) ◽

pp. 148

Author(s):

Sarmad Ismael ◽

Omar Tareq ◽

Yahya Taher Qassim

Keyword(s):

Field Programmable Gate Array ◽

Complete System ◽

Line Drawing ◽

Three Dimensional ◽

End Point ◽

Field Programmable ◽

Speed Up ◽

The One ◽

Scan Conversion ◽

Number Of Segments

<p>Line plotting is the one of the basic operations in the scan conversion. Bresenham’s line drawing algorithm is an efficient and high popular algorithm utilized for this purpose. This algorithm starts from one end-point of the line to the other end-point by calculating one point at each step. As a result, the calculation time for all the points depends on the length of the line thereby the number of the total points presented. In this paper, we developed an approach to speed up the Bresenham algorithm by partitioning each line into number of segments, find the points belong to those segments and drawing them simultaneously to formulate the main line. As a result, the higher number of segments generated, the faster the points are calculated. By employing 32 cores in the Field Programmable Gate Array, a line of length 992 points is formulated in 0.31μs only. The complete system is implemented using Zybo board that contains the Xilinx Zynq-7000 chip (Z-7010).<em></em></p>

Download Full-text