Design of Power-Efficient Training Accelerator for Convolution Neural Networks

JiUn Hong; Saad Arslan; TaeGeon Lee; HyungWon Kim

doi:10.3390/electronics10070787

Design of Power-Efficient Training Accelerator for Convolution Neural Networks

Electronics ◽

10.3390/electronics10070787 ◽

2021 ◽

Vol 10 (7) ◽

pp. 787

Author(s):

JiUn Hong ◽

Saad Arslan ◽

TaeGeon Lee ◽

HyungWon Kim

Keyword(s):

Neural Network ◽

Neural Networks ◽

Energy Consumption ◽

Low Power ◽

Resource Sharing ◽

Graphics Processing Unit ◽

Processing Unit ◽

Real Time Processing ◽

Power Efficient ◽

Low Area

To realize deep learning techniques, a type of deep neural network (DNN) called a convolutional neural networks (CNN) is among the most widely used models aimed at image recognition applications. However, there is growing demand for light-weight and low-power neural network accelerators, not only for inference but also for training process. In this paper, we propose a training accelerator that provides low power and compact chip size targeted for mobile and edge computing applications. It accelerates to achieve the real-time processing of both inference and training using concurrent floating-point data paths. The proposed accelerator can be externally controlled and employs resource sharing and an integrated convolution-pooling block to achieve low area and low energy consumption. We implemented the proposed training accelerator in an FPGA (Field Programmable Gate Array) and evaluated its training performance using an MNIST CNN example in comparison with a PC with GPU (Graphics Processing Unit). While both methods achieved a similar training accuracy of 95.1%, the proposed accelerator, when implemented in a silicon chip, reduced the energy consumption by 480 times compared to the counterpart. Additionally, when implemented on an FPGA, an energy reduction of over 4.5 times was achieved compared to the existing FPGA training accelerator for the MNIST dataset. Therefore, the proposed accelerator is more suitable for deployment in mobile/edge nodes compared to the existing software and hardware accelerators.

Download Full-text

Optimized Compression for Implementing Convolutional Neural Networks on FPGA

Electronics ◽

10.3390/electronics8030295 ◽

2019 ◽

Vol 8 (3) ◽

pp. 295 ◽

Cited By ~ 15

Author(s):

Min Zhang ◽

Linpeng Li ◽

Hai Wang ◽

Yan Liu ◽

Hongbo Qin ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Short Term Memory ◽

Graphics Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Pruning Strategy ◽

Field Programmable ◽

Storage Format ◽

Evaluation Board

Field programmable gate array (FPGA) is widely considered as a promising platform for convolutional neural network (CNN) acceleration. However, the large numbers of parameters of CNNs cause heavy computing and memory burdens for FPGA-based CNN implementation. To solve this problem, this paper proposes an optimized compression strategy, and realizes an accelerator based on FPGA for CNNs. Firstly, a reversed-pruning strategy is proposed which reduces the number of parameters of AlexNet by a factor of 13× without accuracy loss on the ImageNet dataset. Peak-pruning is further introduced to achieve better compressibility. Moreover, quantization gives another 4× with negligible loss of accuracy. Secondly, an efficient storage technique, which aims for the reduction of the whole overhead cache of the convolutional layer and the fully connected layer, is presented respectively. Finally, the effectiveness of the proposed strategy is verified by an accelerator implemented on a Xilinx ZCU104 evaluation board. By improving existing pruning techniques and the storage format of sparse data, we significantly reduce the size of AlexNet by 28×, from 243 MB to 8.7 MB. In addition, the overall performance of our accelerator achieves 9.73 fps for the compressed AlexNet. Compared with the central processing unit (CPU) and graphics processing unit (GPU) platforms, our implementation achieves 182.3× and 1.1× improvements in latency and throughput, respectively, on the convolutional (CONV) layers of AlexNet, with an 822.0× and 15.8× improvement for energy efficiency, separately. This novel compression strategy provides a reference for other neural network applications, including CNNs, long short-term memory (LSTM), and recurrent neural networks (RNNs).

Download Full-text

Real-time Detection of Aortic Valve in Echocardiography using Convolutional Neural Networks

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405615666190114151255 ◽

2020 ◽

Vol 16 (5) ◽

pp. 584-591 ◽

Cited By ~ 1

Author(s):

Muhammad Hanif Ahmad Nizar ◽

Chow Khuen Chan ◽

Azira Khalil ◽

Ahmad Khairuddin Mohamed Yusof ◽

Khin Wee Lai

Keyword(s):

Neural Network ◽

Neural Networks ◽

Heart Disease ◽

Aortic Valve ◽

Real Time ◽

Convolutional Neural Networks ◽

Valvular Heart Disease ◽

Detection System ◽

Processing Unit ◽

Real Time Detection

Background: Valvular heart disease is a serious disease leading to mortality and increasing medical care cost. The aortic valve is the most common valve affected by this disease. Doctors rely on echocardiogram for diagnosing and evaluating valvular heart disease. However, the images from echocardiogram are poor in comparison to Computerized Tomography and Magnetic Resonance Imaging scan. This study proposes the development of Convolutional Neural Networks (CNN) that can function optimally during a live echocardiographic examination for detection of the aortic valve. An automated detection system in an echocardiogram will improve the accuracy of medical diagnosis and can provide further medical analysis from the resulting detection. Methods: Two detection architectures, Single Shot Multibox Detector (SSD) and Faster Regional based Convolutional Neural Network (R-CNN) with various feature extractors were trained on echocardiography images from 33 patients. Thereafter, the models were tested on 10 echocardiography videos. Results: Faster R-CNN Inception v2 had shown the highest accuracy (98.6%) followed closely by SSD Mobilenet v2. In terms of speed, SSD Mobilenet v2 resulted in a loss of 46.81% in framesper- second (fps) during real-time detection but managed to perform better than the other neural network models. Additionally, SSD Mobilenet v2 used the least amount of Graphic Processing Unit (GPU) but the Central Processing Unit (CPU) usage was relatively similar throughout all models. Conclusion: Our findings provide a foundation for implementing a convolutional detection system to echocardiography for medical purposes.

Download Full-text

Neural Networks: Low‐Power Self‐Rectifying Memristive Artificial Neural Network for Near Internet‐of‐Things Sensor Computing (Adv. Electron. Mater. 6/2021)

Advanced Electronic Materials ◽

10.1002/aelm.202170017 ◽

2021 ◽

Vol 7 (6) ◽

pp. 2170017

Author(s):

Seok Choi ◽

Yong Kim ◽

Tien Van Nguyen ◽

Won Hee Jeong ◽

Kyeong‐Sik Min ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Artificial Neural Network ◽

Internet Of Things ◽

Low Power ◽

Artificial Neural

Download Full-text

Real-time, High-resolution Depth Upsampling on Embedded Accelerators

ACM Transactions on Embedded Computing Systems ◽

10.1145/3436878 ◽

2021 ◽

Vol 20 (3) ◽

pp. 1-22

Author(s):

David Langerman ◽

Alan George

Keyword(s):

High Resolution ◽

Low Power ◽

Real Time ◽

Mixed Reality ◽

Graphics Processing Unit ◽

Processing Unit ◽

Reconfigurable Logic ◽

Depth Sensors ◽

Time Requirements ◽

Graphics Processing

High-resolution, low-latency apps in computer vision are ubiquitous in today’s world of mixed-reality devices. These innovations provide a platform that can leverage the improving technology of depth sensors and embedded accelerators to enable higher-resolution, lower-latency processing for 3D scenes using depth-upsampling algorithms. This research demonstrates that filter-based upsampling algorithms are feasible for mixed-reality apps using low-power hardware accelerators. The authors parallelized and evaluated a depth-upsampling algorithm on two different devices: a reconfigurable-logic FPGA embedded within a low-power SoC; and a fixed-logic embedded graphics processing unit. We demonstrate that both accelerators can meet the real-time requirements of 11 ms latency for mixed-reality apps. 1

Download Full-text

Deep learning for denoising

Geophysics ◽

10.1190/geo2018-0668.1 ◽

2019 ◽

Vol 84 (6) ◽

pp. V333-V350 ◽

Cited By ~ 15

Author(s):

Siwei Yu ◽

Jianwei Ma ◽

Wenlong Wang

Keyword(s):

Neural Network ◽

Deep Learning ◽

Graphics Processing Unit ◽

Random Noise ◽

Stochastic Gradient Descent ◽

Processing Unit ◽

Noise Attenuation ◽

Optimal Parameters ◽

Training Set ◽

Unknown Variance

Compared with traditional seismic noise attenuation algorithms that depend on signal models and their corresponding prior assumptions, removing noise with a deep neural network is trained based on a large training set in which the inputs are the raw data sets and the corresponding outputs are the desired clean data. After the completion of training, the deep-learning (DL) method achieves adaptive denoising with no requirements of (1) accurate modelings of the signal and noise or (2) optimal parameters tuning. We call this intelligent denoising. We have used a convolutional neural network (CNN) as the basic tool for DL. In random and linear noise attenuation, the training set is generated with artificially added noise. In the multiple attenuation step, the training set is generated with the acoustic wave equation. The stochastic gradient descent is used to solve the optimal parameters for the CNN. The runtime of DL on a graphics processing unit for denoising has the same order as the [Formula: see text]-[Formula: see text] deconvolution method. Synthetic and field results indicate the potential applications of DL in automatic attenuation of random noise (with unknown variance), linear noise, and multiples.

Download Full-text

Design of Power Efficient 32-Bit Processing Unit

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.16.11415 ◽

2018 ◽

Vol 7 (2.16) ◽

pp. 52

Author(s):

Dharmavaram Asha Devi ◽

Chintala Sandeep ◽

Sai Sugun L

Keyword(s):

Low Power ◽

System Design ◽

Power Analysis ◽

Low Power Design ◽

Design Flow ◽

Processing Unit ◽

Verilog Hdl ◽

Power Efficient ◽

Front End ◽

Verification Process

The proposed paper is discussed about the design, verification and analysis of a 32-bit Processing Unit. The complete front-end design flow is processed using Xilinx Vivado System Design Suite software tools and target verification is done by using Artix 7 FPGA. Virtual I/O concept is used for the verification process. It will perform 32 different operations including parity generation and code conversions: Binary to Grey and Grey to Binary. It is a low power design implemented with Verilog HDL and power analysis is implementedwith clock frequencies ranging from 10MhZ to 100GhZ. With all these frequencies, power analysis is verified for different I/O standards LVCMOS12, LVCMOS25 and LVCMOS33.

Download Full-text

Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocx090 ◽

2017 ◽

Vol 25 (1) ◽

pp. 93-98 ◽

Cited By ~ 31

Author(s):

Yuan Luo ◽

Yu Cheng ◽

Özlem Uzuner ◽

Peter Szolovits ◽

Justin Starren

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

State Of The Art ◽

Graphics Processing Unit ◽

Medical Problem ◽

Feature Engineering ◽

Processing Unit ◽

Clinical Notes ◽

Overall Evaluation ◽

Relation Classification

Abstract We propose Segment Convolutional Neural Networks (Seg-CNNs) for classifying relations from clinical notes. Seg-CNNs use only word-embedding features without manual feature engineering. Unlike typical CNN models, relations between 2 concepts are identified by simultaneously learning separate representations for text segments in a sentence: preceding, concept1, middle, concept2, and succeeding. We evaluate Seg-CNN on the i2b2/VA relation classification challenge dataset. We show that Seg-CNN achieves a state-of-the-art micro-average F-measure of 0.742 for overall evaluation, 0.686 for classifying medical problem–treatment relations, 0.820 for medical problem–test relations, and 0.702 for medical problem–medical problem relations. We demonstrate the benefits of learning segment-level representations. We show that medical domain word embeddings help improve relation classification. Seg-CNNs can be trained quickly for the i2b2/VA dataset on a graphics processing unit (GPU) platform. These results support the use of CNNs computed over segments of text for classifying medical relations, as they show state-of-the-art performance while requiring no manual feature engineering.

Download Full-text

Artificial Neural Networks on FPGAs for Real-Time Energy Reconstruction of the ATLAS LAr Calorimeters

Computing and Software for Big Science ◽

10.1007/s41781-021-00066-y ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Georges Aad ◽

Anne-Sophie Berthold ◽

Thomas Calvet ◽

Nemer Chiedde ◽

Etienne Marie Fortin ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Real Time ◽

Hadron Collider ◽

Liquid Argon ◽

Learning Approaches ◽

High Luminosity ◽

Proton Proton ◽

Real Time Processing ◽

Energy Reconstruction

AbstractThe ATLAS experiment at the Large Hadron Collider (LHC) is operated at CERN and measures proton–proton collisions at multi-TeV energies with a repetition frequency of 40 MHz. Within the phase-II upgrade of the LHC, the readout electronics of the liquid-argon (LAr) calorimeters of ATLAS are being prepared for high luminosity operation expecting a pileup of up to 200 simultaneous proton–proton interactions. Moreover, the calorimeter signals of up to 25 subsequent collisions are overlapping, which increases the difficulty of energy reconstruction by the calorimeter detector. Real-time processing of digitized pulses sampled at 40 MHz is performed using field-programmable gate arrays (FPGAs). To cope with the signal pileup, new machine learning approaches are explored: convolutional and recurrent neural networks outperform the optimal signal filter currently used, both in assignment of the reconstructed energy to the correct proton bunch crossing and in energy resolution. The improvements concern in particular energies derived from overlapping pulses. Since the implementation of the neural networks targets an FPGA, the number of parameters and the mathematical operations need to be well controlled. The trained neural network structures are converted into FPGA firmware using automated implementations in hardware description language and high-level synthesis tools. Very good agreement between neural network implementations in FPGA and software based calculations is observed. The prototype implementations on an Intel Stratix-10 FPGA reach maximum operation frequencies of 344–640 MHz. Applying time-division multiplexing allows the processing of 390–576 calorimeter channels by one FPGA for the most resource-efficient networks. Moreover, the latency achieved is about 200 ns. These performance parameters show that a neural-network based energy reconstruction can be considered for the processing of the ATLAS LAr calorimeter signals during the high-luminosity phase of the LHC.

Download Full-text

Exploring Graphics Processing Unit (GPU) Resource Sharing Efficiency for High Performance Computing

Computers ◽

10.3390/computers2040176 ◽

2013 ◽

Vol 2 (4) ◽

pp. 176-214 ◽

Cited By ~ 5

Author(s):

Teng Li ◽

Vikram Narayana ◽

Tarek El-Ghazawi

Keyword(s):

High Performance Computing ◽

Resource Sharing ◽

High Performance ◽

Graphics Processing Unit ◽

Processing Unit ◽

Graphics Processing ◽

Performance Computing

Download Full-text

Solving the motion planning problem by using neural networks

Robotica ◽

10.1017/s0263574700017343 ◽

1994 ◽

Vol 12 (4) ◽

pp. 323-333 ◽

Cited By ~ 4

Author(s):

R.H.T. Chan ◽

P.K.S. Tam ◽

D.N.K. Leung

Keyword(s):

Neural Network ◽

Neural Networks ◽

Motion Planning ◽

Configuration Space ◽

Logic Gates ◽

Moving Object ◽

Planning Problem ◽

Processing Unit ◽

Configuration Point ◽

Motion Planning Problem

SUMMARYThis paper presents a new neural networks-based method to solve the motion planning problem, i.e. to construct a collision-free path for a moving object among fixed obstacles. Our ‘navigator’ basically consists of two neural networks: The first one is a modified feed-forward neural network, which is used to determine the configuration space; the moving object is modelled as a configuration point in the configuration space. The second neural network is a modified bidirectional associative memory, which is used to find a path for the configuration point through the configuration space while avoiding the configuration obstacles. The basic processing unit of the neural networks may be constructed using logic gates, including AND gates, OR gates, NOT gate and flip flops. Examples of efficient solutions to difficult motion planning problems using our proposed techniques are presented.

Download Full-text