ASimOV: A Framework for Simulation and Optimization of an Embedded AI Accelerator

Dong Hyun Hwang; Chang Yeop Han; Hyun Woo Oh; Seung Eun Lee

doi:10.3390/mi12070838

ASimOV: A Framework for Simulation and Optimization of an Embedded AI Accelerator

Micromachines ◽

10.3390/mi12070838 ◽

2021 ◽

Vol 12 (7) ◽

pp. 838

Author(s):

Dong Hyun Hwang ◽

Chang Yeop Han ◽

Hyun Woo Oh ◽

Seung Eun Lee

Keyword(s):

Artificial Intelligence ◽

Graphics Processing Unit ◽

Processing Unit ◽

Verilog Hdl ◽

Computing Device ◽

Simulation And Optimization ◽

Performance Space ◽

Field Programmable ◽

Hardware Description ◽

Graphics Processing

Artificial intelligence algorithms need an external computing device such as a graphics processing unit (GPU) due to computational complexity. For running artificial intelligence algorithms in an embedded device, many studies proposed light-weighted artificial intelligence algorithms and artificial intelligence accelerators. In this paper, we propose the ASimOV framework, which optimizes artificial intelligence algorithms and generates Verilog hardware description language (HDL) code for executing intelligence algorithms in field programmable gate array (FPGA). To verify ASimOV, we explore the performance space of k-NN algorithms and generate Verilog HDL code to demonstrate the k-NN accelerator in FPGA. Our contribution is to provide the artificial intelligence algorithm as an end-to-end pipeline and ensure that it is optimized to a specific dataset through simulation, and an artificial intelligence accelerator is generated in the end.

Download Full-text

Embedded GPU Implementation for High-Performance Ultrasound Imaging

Electronics ◽

10.3390/electronics10080884 ◽

2021 ◽

Vol 10 (8) ◽

pp. 884

Author(s):

Stefano Rossi ◽

Enrico Boni

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Digital Signal ◽

Processing Unit ◽

Embedded Computing ◽

Field Programmable ◽

Peripheral Component Interconnect ◽

Programmable Gate Arrays ◽

Graphics Processing ◽

Signal Processors

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

Download Full-text

Accelerating Event Detection with DGCNN and FPGAs

Electronics ◽

10.3390/electronics9101666 ◽

2020 ◽

Vol 9 (10) ◽

pp. 1666

Author(s):

Zhe Han ◽

Jingfei Jiang ◽

Linbo Qiao ◽

Yong Dou ◽

Jinwei Xu ◽

...

Keyword(s):

Language Processing ◽

Event Detection ◽

Graphics Processing Unit ◽

Network Size ◽

Processing Unit ◽

Sigmoid Function ◽

Pipelined Architecture ◽

Proposed Model ◽

Field Programmable ◽

Graphics Processing

Recently, Deep Neural Networks (DNNs) have been widely used in natural language processing. However, DNNs are often computation-intensive and memory-expensive. Therefore, deploying DNNs in the real world is very difficult. In order to solve this problem, we proposed a network model based on the dilate gated convolutional neural network, which is very hardware-friendly. We further expanded the word representations and depth of the network to improve the performance of the model. We replaced the Sigmoid function to make it more friendly for hardware computation without loss, and we quantized the network weights and activations to compress the network size. We then proposed the first FPGA (Field Programmable Gate Array)-based event detection accelerator based on the proposed model. The accelerator significantly reduced the latency with the fully pipelined architecture. We implemented the accelerator on the Xilinx XCKU115 FPGA. The experimental results show that our model obtains the highest F1-score of 84.6% in the ACE 2005 corpus. Meanwhile, the accelerator achieved 95.2 giga operations (GOP)/s and 13.4 GOPS/W in performance and energy efficiency, which is 17/158 times higher than the Graphics Processing Unit (GPU).

Download Full-text

Artificial intelligence-based classification with classical Turkish music makams: Possibilities to Turkish music education

African Educational Research Journal ◽

10.30918/aerj.92.21.055 ◽

2021 ◽

Vol 9 (2) ◽

pp. 570-580

Author(s):

Mert Kayış ◽

Keyword(s):

Artificial Intelligence ◽

Feature Extraction ◽

Music Education ◽

Success Rate ◽

Classical Music ◽

Graphics Processing Unit ◽

Support Vector ◽

Processing Unit ◽

Percent Success ◽

Graphics Processing

Makams of Classical Turkish Music have been tried to be classified through various studies for the past years. Significant differences of opinion have emerged in the classification process of the makams in Music Education and Literacy from past to present. This situation creates problems in learning the makams related to music education and recognizing the makams heard. Additionally, there are uncertainties in the classification of the makam genre of the song, as individual mistakes were made while notating the musical notes. Apart from that, this situation constitutes a problem not only for the ones studying Turkish Classical Music but also for the ones interested in this certain type of Music. Therefore, the objective of the research is to contribute to the makam classification in Classical Turkish Music Education by developing an MIR system that determines the makam of the songs. Theoretically, we can extract the properties of sound signals with Time Wavelet Scattering Feature Extraction, classify them with SVM and distinguish between types of makams. In this study, upon eight different Makams, a Musical Information Retrieval system has been created via the Artificial Intelligence (AI) method of Support Vector Machines (SVM) and Time Wavelet Scattering Feature Extraction and through using a Graphics Processing Unit (GPU) accelerator for the sake of feature extraction. We performed the classification process by modeling it on the MATLAB program. The study's success rate was identified as 98.21% and it acquired a higher success rate compared to the other studies in the literature. After completing the classification procedure, the Makams were identified by sending samples belonging to different sound files from the system consisting of a database belonging to eight different Makams. In our study, the classification and detection processes were realized with nearly a hundred percent success. The difficulties encountered in classifying the makams in Classical Turkish Music mentioned above, with the application of artificial intelligence, the classification difficulty of individuals who have received this type of training or are interested in this subject has been overcome.

Download Full-text

Implementation of Membrane Algorithms on GPU

Journal of Applied Mathematics ◽

10.1155/2014/307617 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 3

Author(s):

Xingyi Zhang ◽

Bangju Wang ◽

Zhuanlian Ding ◽

Jin Tang ◽

Juanjuan He

Keyword(s):

Graphics Processing Unit ◽

Processing Unit ◽

Matching Problem ◽

Computing Device ◽

Central Processing ◽

New Class ◽

Intractable Problems ◽

Point Set ◽

Graphics Processing ◽

Gpu Implementation

Membrane algorithms are a new class of parallel algorithms, which attempt to incorporate some components of membrane computing models for designing efficient optimization algorithms, such as the structure of the models and the way of communication between cells. Although the importance of the parallelism of such algorithms has been well recognized, membrane algorithms were usually implemented on the serial computing device central processing unit (CPU), which makes the algorithms unable to work in an efficient way. In this work, we consider the implementation of membrane algorithms on the parallel computing device graphics processing unit (GPU). In such implementation, all cells of membrane algorithms can work simultaneously. Experimental results on two classical intractable problems, the point set matching problem and TSP, show that the GPU implementation of membrane algorithms is much more efficient than CPU implementation in terms of runtime, especially for solving problems with a high complexity.

Download Full-text

Research of parallel algorithms for solving three-diagonal systems of linear algebraic equations on a graphical computing device using various types of memory

Information Technology and Nanotechnology ◽

10.18287/1613-0073-2019-2416-227-232 ◽

2019 ◽

pp. 227-232

Author(s):

X S Pogorelskih ◽

L V Loganova

Keyword(s):

Linear System ◽

Graphics Processing Unit ◽

Processing Unit ◽

Algebraic Equations ◽

Computing Device ◽

Cyclic Reduction ◽

Parallel Version ◽

Linear Algebraic Equations ◽

Different Types ◽

Graphics Processing

In this paper, we research and compare different implementations of cyclic reduction and sweep algorithms on a graphics processing unit (GPU) using different types of device memory. As a result of the work, it was found that the algorithm of the run should be used to solve the set of the tridiagonal linear system. However, the best results are shown by a parallel version of the cyclic reduction algorithm with partial use of shared memory, when solving a single linear system on the GPU.

Download Full-text

A TensorFlow Extension Framework for Optimized Generation of Hardware CNN Inference Engines

Technologies ◽

10.3390/technologies8010006 ◽

2020 ◽

Vol 8 (1) ◽

pp. 6 ◽

Cited By ~ 1

Author(s):

Vasileios Leon ◽

Spyridon Mouselinos ◽

Konstantina Koliogeorgi ◽

Sotirios Xydis ◽

Dimitrios Soudris ◽

...

Keyword(s):

Integrated Circuit ◽

High Performance ◽

Graphics Processing Unit ◽

Inference Engine ◽

Efficient Solutions ◽

Processing Unit ◽

Central Processing ◽

Inference Engines ◽

Field Programmable ◽

Hardware Description

The workloads of Convolutional Neural Networks (CNNs) exhibit a streaming nature that makes them attractive for reconfigurable architectures such as the Field-Programmable Gate Arrays (FPGAs), while their increased need for low-power and speed has established Application-Specific Integrated Circuit (ASIC)-based accelerators as alternative efficient solutions. During the last five years, the development of Hardware Description Language (HDL)-based CNN accelerators, either for FPGA or ASIC, has seen huge academic interest due to their high-performance and room for optimizations. Towards this direction, we propose a library-based framework, which extends TensorFlow, the well-established machine learning framework, and automatically generates high-throughput CNN inference engines for FPGAs and ASICs. The framework allows software developers to exploit the benefits of FPGA/ASIC acceleration without requiring any expertise on HDL development and low-level design. Moreover, it provides a set of optimization knobs concerning the model architecture and the inference engine generation, allowing the developer to tune the accelerator according to the requirements of the respective use case. Our framework is evaluated by optimizing the LeNet CNN model on the MNIST dataset, and implementing FPGA- and ASIC-based accelerators using the generated inference engine. The optimal FPGA-based accelerator on Zynq-7000 delivers 93% less memory footprint and 54% less Look-Up Table (LUT) utilization, and up to 10× speedup on the inference execution vs. different Graphics Processing Unit (GPU) and Central Processing Unit (CPU) implementations of the same model, in exchange for a negligible accuracy loss, i.e., 0.89%. For the same accuracy drop, the 45 nm standard-cell-based ASIC accelerator provides an implementation which operates at 520 MHz and occupies an area of 0.059 mm 2 , while the power consumption is ∼7.5 mW.

Download Full-text

An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution

Electronics ◽

10.3390/electronics8030281 ◽

2019 ◽

Vol 8 (3) ◽

pp. 281 ◽

Cited By ~ 11

Author(s):

Bing Liu ◽

Danyin Zou ◽

Lei Feng ◽

Shou Feng ◽

Ping Fu ◽

...

Keyword(s):

Graphics Processing Unit ◽

Processing Unit ◽

Network Layers ◽

Ping Pong ◽

Parameter Configuration ◽

Field Programmable ◽

Hardware Resource ◽

Roofline Model ◽

On Chip ◽

Graphics Processing

The Convolutional Neural Network (CNN) has been used in many fields and has achieved remarkable results, such as image classification, face detection, and speech recognition. Compared to GPU (graphics processing unit) and ASIC, a FPGA (field programmable gate array)-based CNN accelerator has great advantages due to its low power consumption and reconfigurable property. However, FPGA’s extremely limited resources and CNN’s huge amount of parameters and computational complexity pose great challenges to the design. Based on the ZYNQ heterogeneous platform and the coordination of resource and bandwidth issues with the roofline model, the CNN accelerator we designed can accelerate both standard convolution and depthwise separable convolution with a high hardware resource rate. The accelerator can handle network layers of different scales through parameter configuration and maximizes bandwidth and achieves full pipelined by using a data stream interface and ping-pong on-chip cache. The experimental results show that the accelerator designed in this paper can achieve 17.11GOPS for 32bit floating point when it can also accelerate depthwise separable convolution, which has obvious advantages compared with other designs.

Download Full-text

Fast iterative solvers for large compressed-sparse row linear systems on graphics processing unit

Pollack Periodica ◽

10.1556/pollack.10.2015.1.1 ◽

2015 ◽

Vol 10 (1) ◽

pp. 3-18 ◽

Cited By ~ 1

Author(s):

Frédéric Magoulès ◽

Abal-Kassim Cheik Ahamed ◽

Roman Putanowicz

Keyword(s):

Linear Systems ◽

Graphics Processing Unit ◽

Iterative Solvers ◽

Processing Unit ◽

Compressed Sparse Row ◽

Graphics Processing

Download Full-text

Performance Analysis and Optimization of Graphics Processing Unit

SSRN Electronic Journal ◽

10.2139/ssrn.3350249 ◽

2019 ◽

Author(s):

Lokendra Singh Umrao ◽

Jay Prakash Pandey

Keyword(s):

Performance Analysis ◽

Graphics Processing Unit ◽

Processing Unit ◽

Graphics Processing

Download Full-text

Implementing wide baseline matching algorithms on a graphics processing unit.

10.2172/921737 ◽

2007 ◽

Author(s):

Fredrick H. Rothganger ◽

Kurt W. Larson ◽

Antonio Ignacio Gonzales ◽

Daniel S. Myers

Keyword(s):

Graphics Processing Unit ◽

Processing Unit ◽

Wide Baseline Matching ◽

Graphics Processing

Download Full-text