A TensorFlow Extension Framework for Optimized Generation of Hardware CNN Inference Engines

Vasileios Leon; Spyridon Mouselinos; Konstantina Koliogeorgi; Sotirios Xydis; Dimitrios Soudris; Kiamal Pekmestzi

doi:10.3390/technologies8010006

A TensorFlow Extension Framework for Optimized Generation of Hardware CNN Inference Engines

Technologies ◽

10.3390/technologies8010006 ◽

2020 ◽

Vol 8 (1) ◽

pp. 6 ◽

Cited By ~ 1

Author(s):

Vasileios Leon ◽

Spyridon Mouselinos ◽

Konstantina Koliogeorgi ◽

Sotirios Xydis ◽

Dimitrios Soudris ◽

...

Keyword(s):

Integrated Circuit ◽

High Performance ◽

Graphics Processing Unit ◽

Inference Engine ◽

Efficient Solutions ◽

Processing Unit ◽

Central Processing ◽

Inference Engines ◽

Field Programmable ◽

Hardware Description

The workloads of Convolutional Neural Networks (CNNs) exhibit a streaming nature that makes them attractive for reconfigurable architectures such as the Field-Programmable Gate Arrays (FPGAs), while their increased need for low-power and speed has established Application-Specific Integrated Circuit (ASIC)-based accelerators as alternative efficient solutions. During the last five years, the development of Hardware Description Language (HDL)-based CNN accelerators, either for FPGA or ASIC, has seen huge academic interest due to their high-performance and room for optimizations. Towards this direction, we propose a library-based framework, which extends TensorFlow, the well-established machine learning framework, and automatically generates high-throughput CNN inference engines for FPGAs and ASICs. The framework allows software developers to exploit the benefits of FPGA/ASIC acceleration without requiring any expertise on HDL development and low-level design. Moreover, it provides a set of optimization knobs concerning the model architecture and the inference engine generation, allowing the developer to tune the accelerator according to the requirements of the respective use case. Our framework is evaluated by optimizing the LeNet CNN model on the MNIST dataset, and implementing FPGA- and ASIC-based accelerators using the generated inference engine. The optimal FPGA-based accelerator on Zynq-7000 delivers 93% less memory footprint and 54% less Look-Up Table (LUT) utilization, and up to 10× speedup on the inference execution vs. different Graphics Processing Unit (GPU) and Central Processing Unit (CPU) implementations of the same model, in exchange for a negligible accuracy loss, i.e., 0.89%. For the same accuracy drop, the 45 nm standard-cell-based ASIC accelerator provides an implementation which operates at 520 MHz and occupies an area of 0.059 mm 2 , while the power consumption is ∼7.5 mW.

Download Full-text

Embedded GPU Implementation for High-Performance Ultrasound Imaging

Electronics ◽

10.3390/electronics10080884 ◽

2021 ◽

Vol 10 (8) ◽

pp. 884

Author(s):

Stefano Rossi ◽

Enrico Boni

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Digital Signal ◽

Processing Unit ◽

Embedded Computing ◽

Field Programmable ◽

Peripheral Component Interconnect ◽

Programmable Gate Arrays ◽

Graphics Processing ◽

Signal Processors

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

Download Full-text

Method for improving ripple reduction during phase shedding in multiphase buck converters for SCADA systems

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v24.i1.pp29-36 ◽

2021 ◽

Vol 24 (1) ◽

pp. 29

Author(s):

Mini P. Varghese ◽

A. Manjunatha ◽

T. V. Snehaprabha

Keyword(s):

Power Supply ◽

Integrated Circuit ◽

Buck Converter ◽

Multiphase System ◽

Processing Unit ◽

Central Processing ◽

Light Load ◽

Field Programmable ◽

Ripple Reduction ◽

Application Specific Integrated Circuit

In the current digital environment, central processing unit (CPUs), field programmable gate array (FPGAs), application-specific integrated circuit (ASICs), as well as peripherals, are growing progressively complex. On motherboards in many areas of computing, from laptops and tablets to servers and Ethernet switches, multiphase phase buck regulators are seen to be more common nowadays, because of the higher power requirements. This study describes a four-stage buck converter with a phase shedding scheme that can be used to power processors in programmable logic controller (PLCs). The proposed power supply is designed to generate a regulated voltage with minimal ripple. Because of the suggested phase shedding method, this power supply also offers better light load efficiency. For this objective, a multiphase system with phase shedding is modeled in MATLAB SIMULINK, and the findings are validated.

Download Full-text

A Web-Lab Environment for the Study of the Job Shop Problem

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.463-464.1073 ◽

2012 ◽

Vol 463-464 ◽

pp. 1073-1076

Author(s):

Helmar Alvares ◽

Eliana Prado Lopes Aude ◽

Ernesto Prado Lopes

Keyword(s):

High Performance ◽

Job Shop ◽

Response Times ◽

Graphics Processing Unit ◽

Low Cost ◽

Parallel Architecture ◽

Efficient Solutions ◽

Processing Unit ◽

Scheduling Problems ◽

Solution Methods

This work proposes a Web-Based laboratory where researchers share the facilities of a simulation environment for parallel algorithms which solves scheduling problems known as Job Shop Problem (JSP). The environment supports multi-language platforms and uses a low cost, high performance Graphics Processing Unit (GPU) connected to a Java application server to help design more efficient solutions for JSP. Within a single web environment one can analyze and compare different methods and meta-heuristics. Each newly developed method is stored in an environment library and made available to all other users of the environment. This amassment of openly accessible solution methods will allow for the rapid convergence towards optimal solutions for JSP. The algorithm uses the parallel architecture of the system to handle threads. Each thread represents a job operation and the number of threads scales with the problem’s size. The threads exchange information in order to find the best solution. This cooperation decreases response times by one or two orders of magnitude.

Download Full-text

Optimized Compression for Implementing Convolutional Neural Networks on FPGA

Electronics ◽

10.3390/electronics8030295 ◽

2019 ◽

Vol 8 (3) ◽

pp. 295 ◽

Cited By ~ 15

Author(s):

Min Zhang ◽

Linpeng Li ◽

Hai Wang ◽

Yan Liu ◽

Hongbo Qin ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Short Term Memory ◽

Graphics Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Pruning Strategy ◽

Field Programmable ◽

Storage Format ◽

Evaluation Board

Field programmable gate array (FPGA) is widely considered as a promising platform for convolutional neural network (CNN) acceleration. However, the large numbers of parameters of CNNs cause heavy computing and memory burdens for FPGA-based CNN implementation. To solve this problem, this paper proposes an optimized compression strategy, and realizes an accelerator based on FPGA for CNNs. Firstly, a reversed-pruning strategy is proposed which reduces the number of parameters of AlexNet by a factor of 13× without accuracy loss on the ImageNet dataset. Peak-pruning is further introduced to achieve better compressibility. Moreover, quantization gives another 4× with negligible loss of accuracy. Secondly, an efficient storage technique, which aims for the reduction of the whole overhead cache of the convolutional layer and the fully connected layer, is presented respectively. Finally, the effectiveness of the proposed strategy is verified by an accelerator implemented on a Xilinx ZCU104 evaluation board. By improving existing pruning techniques and the storage format of sparse data, we significantly reduce the size of AlexNet by 28×, from 243 MB to 8.7 MB. In addition, the overall performance of our accelerator achieves 9.73 fps for the compressed AlexNet. Compared with the central processing unit (CPU) and graphics processing unit (GPU) platforms, our implementation achieves 182.3× and 1.1× improvements in latency and throughput, respectively, on the convolutional (CONV) layers of AlexNet, with an 822.0× and 15.8× improvement for energy efficiency, separately. This novel compression strategy provides a reference for other neural network applications, including CNNs, long short-term memory (LSTM), and recurrent neural networks (RNNs).

Download Full-text

EDSSA: An Encoder-Decoder Semantic Segmentation Networks Accelerator on OpenCL-Based FPGA Platform

Sensors ◽

10.3390/s20143969 ◽

2020 ◽

Vol 20 (14) ◽

pp. 3969

Author(s):

Hongzhi Huang ◽

Yakun Wu ◽

Mengqi Yu ◽

Xuesong Shi ◽

Fei Qiao ◽

...

Keyword(s):

Power Consumption ◽

High Performance ◽

Graphics Processing Unit ◽

Semantic Segmentation ◽

Autonomous Driving ◽

Processing Unit ◽

Algorithm Optimization ◽

Design Algorithm ◽

Field Programmable ◽

Computing Platforms

Visual semantic segmentation, which is represented by the semantic segmentation network, has been widely used in many fields, such as intelligent robots, security, and autonomous driving. However, these Convolutional Neural Network (CNN)-based networks have high requirements for computing resources and programmability for hardware platforms. For embedded platforms and terminal devices in particular, Graphics Processing Unit (GPU)-based computing platforms cannot meet these requirements in terms of size and power consumption. In contrast, the Field Programmable Gate Array (FPGA)-based hardware system not only has flexible programmability and high embeddability, but can also meet lower power consumption requirements, which make it an appropriate solution for semantic segmentation on terminal devices. In this paper, we demonstrate EDSSA—an Encoder-Decoder semantic segmentation networks accelerator architecture which can be implemented with flexible parameter configurations and hardware resources on the FPGA platforms that support Open Computing Language (OpenCL) development. We introduce the related technologies, architecture design, algorithm optimization, and hardware implementation of the Encoder-Decoder semantic segmentation network SegNet as an example, and undertake a performance evaluation. Using an Intel Arria-10 GX1150 platform for evaluation, our work achieves a throughput higher than 432.8 GOP/s with power consumption of about 20 W, which is a 1.2× times improvement the energy-efficiency ratio compared to a high-performance GPU.

Download Full-text

Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design

Parallel Processing Letters ◽

10.1142/s0129626417500062 ◽

2017 ◽

Vol 27 (03n04) ◽

pp. 1750006 ◽

Cited By ~ 4

Author(s):

Farhad Merchant ◽

Anupam Chattopadhyay ◽

Soumyendu Raha ◽

S. K. Nandy ◽

Ranjani Narayan

Keyword(s):

Linear Algebra ◽

High Performance ◽

Graphics Processing Unit ◽

Building Blocks ◽

General Purpose ◽

Performance Tuning ◽

Floating Point ◽

Processing Unit ◽

Field Programmable ◽

The Impact

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops/W, and 1.9X to 2.1X in Gflops/mm2. Compared to multicore, General Purpose Graphics Processing Unit (GPGPU), Field Programmable Gate Array (FPGA), and ClearSpeed CSX700, performance improvement of 1.8-80x is reported in PE.

Download Full-text

Recent Results on the Implementation of a Burst Error and Burst Erasure Channel Emulator Using an FPGA Architecture

Journal of Communications Software and Systems ◽

10.24138/jcomss.v16i1.766 ◽

2020 ◽

Vol 16 (1) ◽

pp. 19-29

Author(s):

Caterina Travan ◽

Francesca Vatta ◽

Fulvio Babich

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Transmission Channel ◽

Central Processing ◽

Fpga Architecture ◽

Burst Error ◽

Field Programmable ◽

Erasure Channel ◽

Burst Erasure

The behaviour of a transmission channel may be simulated using the performance abilities of current generation multiprocessing hardware, namely, a multicore Central Processing Unit (CPU), a general purpose Graphics Processing Unit (GPU), or a Field Programmable Gate Array (FPGA). These were investigated by Cullinan et al. in a recent paper (published in 2012) where these three devices capabilities were compared to determine which device would be best suited towards which specific task. In particular, it was shown that, for the application which is objective of our work (i.e., for a transmission channel simulation), the FPGA is 26.67 times faster than the GPU and 10.76 times faster than the CPU. Motivated by these results, in this paper we propose and present a direct hardware emulation. In particular, a Cyclone II FPGA architecture is implemented to simulate a burst error channel behaviour, in which errors are clustered together, and a burst erasure channel behaviour, in which the erasures are clustered together. The results presented in the paper are valid for any FPGA architecture that may be considered for this scope.

Download Full-text

SWIRL: High-performance many-core CPU code generation for deep neural networks

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019866247 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1275-1289 ◽

Cited By ~ 3

Author(s):

Anand Venkat ◽

Tharindu Rusira ◽

Raj Barik ◽

Mary Hall ◽

Leonard Truong

Keyword(s):

Neural Networks ◽

Language Processing ◽

Code Generation ◽

High Performance ◽

Deep Neural Networks ◽

Graphics Processing Unit ◽

Processing Unit ◽

Data Movement ◽

Central Processing ◽

The Status

Deep neural networks (DNNs) have demonstrated effectiveness in many domains including object recognition, speech recognition, natural language processing, and health care. Typically, the computations involved in DNN training and inferencing are time consuming and require efficient implementations. Existing frameworks such as TensorFlow, Theano, Torch, Cognitive Tool Kit (CNTK), and Caffe enable Graphics Processing Unit (GPUs) as the status quo devices for DNN execution, leaving Central Processing Unit (CPUs) behind. Moreover, existing frameworks forgo or limit cross layer optimization opportunities that have the potential to improve performance by significantly reducing data movement through the memory hierarchy. In this article, we describe an alternative approach called SWIRL, a compiler that provides high-performance CPU implementations for DNNs. SWIRL is built on top of the existing domain-specific language (DSL) for DNNs called LATTE. SWIRL separates DNN specification and its schedule using predefined transformation recipes for tensors and layers commonly found in DNN layers. These recipes synergize with DSL constructs to generate high-quality fused, vectorized, and parallelized code for CPUs. On an Intel Xeon Platinum 8180M CPU, SWIRL achieves performance comparable with Tensorflow integrated with MKL-DNN; on average 1.00× of Tensorflow inference and 0.99× of Tensorflow training. It also outperforms the original LATTE compiler on average by 1.22× and 1.30× on inference and training, respectively.

Download Full-text

SatEC: A 5G Satellite Edge Computing Framework Based on Microservice Architecture

Sensors ◽

10.3390/s19040831 ◽

2019 ◽

Vol 19 (4) ◽

pp. 831 ◽

Cited By ~ 12

Author(s):

Lei Yan ◽

Suzhi Cao ◽

Yongsheng Gong ◽

Hao Han ◽

Junyong Wei ◽

...

Keyword(s):

Graphics Processing Unit ◽

Edge Computing ◽

Satellite Network ◽

Processing Unit ◽

Central Processing ◽

5G Network ◽

Field Programmable ◽

High Bandwidth ◽

Basic Services ◽

Computing Framework

As outlined in the 3Gpp Release 16, 5G satellite access is important for 5G network development in the future. A terrestrial-satellite network integrated with 5G has the characteristics of low delay, high bandwidth, and ubiquitous coverage. A few researchers have proposed integrated schemes for such a network; however, these schemes do not consider the possibility of achieving optimization of the delay characteristic by changing the computing mode of the 5G satellite network. We propose a 5G satellite edge computing framework (5GsatEC), which aims to reduce delay and expand network coverage. This framework consists of embedded hardware platforms and edge computing microservices in satellites. To increase the flexibility of the framework in complex scenarios, we unify the resource management of the central processing unit (CPU), graphics processing unit (GPU), and field-programmable gate array (FPGA); we divide the services into three types: system services, basic services, and user services. In order to verify the performance of the framework, we carried out a series of experiments. The results show that 5GsatEC has a broader coverage than the ground 5G network. The results also show that 5GsatEC has lower delay, a lower packet loss rate, and lower bandwidth consumption than the 5G satellite network.

Download Full-text

ASimOV: A Framework for Simulation and Optimization of an Embedded AI Accelerator

Micromachines ◽

10.3390/mi12070838 ◽

2021 ◽

Vol 12 (7) ◽

pp. 838

Author(s):

Dong Hyun Hwang ◽

Chang Yeop Han ◽

Hyun Woo Oh ◽

Seung Eun Lee

Keyword(s):

Artificial Intelligence ◽

Graphics Processing Unit ◽

Processing Unit ◽

Verilog Hdl ◽

Computing Device ◽

Simulation And Optimization ◽

Performance Space ◽

Field Programmable ◽

Hardware Description ◽

Graphics Processing

Artificial intelligence algorithms need an external computing device such as a graphics processing unit (GPU) due to computational complexity. For running artificial intelligence algorithms in an embedded device, many studies proposed light-weighted artificial intelligence algorithms and artificial intelligence accelerators. In this paper, we propose the ASimOV framework, which optimizes artificial intelligence algorithms and generates Verilog hardware description language (HDL) code for executing intelligence algorithms in field programmable gate array (FPGA). To verify ASimOV, we explore the performance space of k-NN algorithms and generate Verilog HDL code to demonstrate the k-NN accelerator in FPGA. Our contribution is to provide the artificial intelligence algorithm as an end-to-end pipeline and ensure that it is optimized to a specific dataset through simulation, and an artificial intelligence accelerator is generated in the end.

Download Full-text