A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity Processors

Youngbin Son; Seokwon Kang; Hongjun Um; Seokho Lee; Jonghyun Ham; Donghyeon Kim; Yongjun Park

doi:10.3390/electronics10232960

A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity Processors

Electronics ◽

10.3390/electronics10232960 ◽

2021 ◽

Vol 10 (23) ◽

pp. 2960

Author(s):

Youngbin Son ◽

Seokwon Kang ◽

Hongjun Um ◽

Seokho Lee ◽

Jonghyun Ham ◽

...

Keyword(s):

Code Generation ◽

Optimization Technique ◽

Data Parallelism ◽

Processing Unit ◽

Fast Computation ◽

Performance Improvements ◽

Central Processing ◽

Field Programmable ◽

Operation Unit ◽

Better Than

Most modern processors contain a vector accelerator or internal vector units for the fast computation of large target workloads. However, accelerating applications using vector units is difficult because the underlying data parallelism should be uncovered explicitly using vector-specific instructions. Therefore, vector units are often underutilized or remain idle because of the challenges faced in vector code generation. To solve this underutilization problem of existing vector units, we propose the Vector Offloader for executing scalar programs, which considers the vector unit as a scalar operation unit. By using vector masking, an appropriate partition of the vector unit can be utilized to support scalar instructions. To efficiently utilize all execution units, including the vector unit, the Vector Offloader suggests running the target applications concurrently in both the central processing unit (CPU) and the decoupled vector units, by offloading some parts of the program to the vector unit. Furthermore, a profile-guided optimization technique is employed to determine the optimal offloading ratio for balancing the load between the CPU and the vector unit. We implemented the Vector Offloader on a RISC-V infrastructure with a Hwacha vector unit, and evaluated its performance using a Polybench benchmark set. Experimental results showed that the proposed technique achieved performance improvements up to 1.31× better than the simple, CPU-only execution on a field programmable gate array (FPGA)-level evaluation.

Download Full-text

Implementación de redes neuronales en FPGAs usando tipos de datos de punto fijo

Jornada de Jóvenes Investigadores del I3A ◽

10.26754/jjii3a.4849 ◽

2020 ◽

Vol 8 ◽

Author(s):

Daniel Enériz Orta ◽

Nicolás Medrano Marqués ◽

Belén Calvo López

Keyword(s):

Field Programmable Gate Array ◽

A Priori ◽

Central Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Field Programmable ◽

Gate Array

La capacidad de estimar funciones no lineales hace que las redes neuronales sean una de las herramientas más usadas para aplicar fusión sensorial, permitiendo combinar la salida de diferentes sensores para obtener información de la que a priori no se dispone. Por otra parte, la capacidad de procesamiento paralelo de las FPGAs (Field-Programmable Gate Array) las hace idóneas para implementar redes neuronales ubicuas, permitiendo inferir resultados más rápido que una CPU (Central Processing Unit) sin necesidad de una conexión activa a internet. De esta forma, en este artículo se propone un flujo de trabajo para diseñar, entrenar e implementar una red neuronal en una FPGA Xilinx PYNQ Z2 que use tipos de dato de punto fijo para hacer fusión sensorial. Dicho flujo de trabajo es probado mediante el desarrollo de una red neuronal que combine las salidas de una nariz artificial de 16 sensores para obtener una estimación de las concentraciones de CH4 y C2H4.

Download Full-text

Real-Time Monte Carlo Optimization on FPGA for the Efficient and Reliable Message Chain Structure

Electronics ◽

10.3390/electronics8080866 ◽

2019 ◽

Vol 8 (8) ◽

pp. 866 ◽

Cited By ~ 1

Author(s):

Heoncheol Lee ◽

Kipyo Kim

Keyword(s):

Monte Carlo ◽

Real Time ◽

Communication Systems ◽

Optimization Method ◽

Chain Structure ◽

Processing Unit ◽

Monte Carlo Optimization ◽

Central Processing ◽

Field Programmable ◽

Programmable Gate Arrays

This paper addresses the real-time optimization problem to find the most efficient and reliable message chain structure in data communications based on half-duplex command–response protocols such as MIL-STD-1553B communication systems. This paper proposes a real-time Monte Carlo optimization method implemented on field programmable gate arrays (FPGA) which can not only be conducted very quickly but also avoid the conflicts with other tasks on a central processing unit (CPU). Evaluation results showed that the proposed method can consistently find the optimal message chain structure within a quite small and deterministic time, which was much faster than the conventional Monte Carlo optimization method on a CPU.

Download Full-text

Experimental validation of a virtual engine-out NOx sensor for diesel emission control

International Journal of Engine Research ◽

10.1177/1468087419857584 ◽

2019 ◽

Vol 20 (10) ◽

pp. 1037-1046 ◽

Cited By ~ 1

Author(s):

Paul Mentink ◽

Daniel Escobar-Valdivieso ◽

Alexandru Forrai ◽

Xander Seykens ◽

Frank Willems

Keyword(s):

Processing Unit ◽

Central Processing ◽

Nox Sensor ◽

Heavy Duty Diesel Engine ◽

Field Programmable ◽

Diesel Emission ◽

Diesel Emission Control ◽

Heavy Duty Diesel ◽

Main Input ◽

Automotive Emission

Motivated by automotive emission legislations, a Virtual [Formula: see text] sensor is developed. This virtual sensor consists of a real-time, phenomenological model that computes engine-out [Formula: see text] by using the measured in-cylinder pressure signal from a single cylinder as its main input. The implementation is made on a Field Programmable Gate Array–Central Processing Unit architecture to ensure the [Formula: see text] computation is ready at the end of the combustion cycle. The Virtual [Formula: see text] sensor is tested and validated on an EURO-VI Heavy-Duty Diesel engine platform. The Virtual [Formula: see text] sensor is proven to meet the accuracy of a production [Formula: see text] sensor for steady-state conditions and has better frequency response compared to the production [Formula: see text] sensor.

Download Full-text

Literature Survey on Stereo Vision Disparity Map Algorithms

Journal of Sensors ◽

10.1155/2016/8742920 ◽

2016 ◽

Vol 2016 ◽

pp. 1-23 ◽

Cited By ~ 57

Author(s):

Rostam Affendi Hamzah ◽

Haidi Ibrahim

Keyword(s):

Stereo Vision ◽

Stereo Matching ◽

Literature Survey ◽

Processing Unit ◽

Stereo Correspondence ◽

Disparity Map ◽

Central Processing ◽

Field Programmable ◽

Processing Module ◽

Graphical Processing

This paper presents a literature survey on existing disparity map algorithms. It focuses on four main stages of processing as proposed by Scharstein and Szeliski in a taxonomy and evaluation of dense two-frame stereo correspondence algorithms performed in 2002. To assist future researchers in developing their own stereo matching algorithms, a summary of the existing algorithms developed for every stage of processing is also provided. The survey also notes the implementation of previous software-based and hardware-based algorithms. Generally, the main processing module for a software-based implementation uses only a central processing unit. By contrast, a hardware-based implementation requires one or more additional processors for its processing module, such as graphical processing unit or a field programmable gate array. This literature survey also presents a method of qualitative measurement that is widely used by researchers in the area of stereo vision disparity mappings.

Download Full-text

Method for improving ripple reduction during phase shedding in multiphase buck converters for SCADA systems

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v24.i1.pp29-36 ◽

2021 ◽

Vol 24 (1) ◽

pp. 29

Author(s):

Mini P. Varghese ◽

A. Manjunatha ◽

T. V. Snehaprabha

Keyword(s):

Power Supply ◽

Integrated Circuit ◽

Buck Converter ◽

Multiphase System ◽

Processing Unit ◽

Central Processing ◽

Light Load ◽

Field Programmable ◽

Ripple Reduction ◽

Application Specific Integrated Circuit

In the current digital environment, central processing unit (CPUs), field programmable gate array (FPGAs), application-specific integrated circuit (ASICs), as well as peripherals, are growing progressively complex. On motherboards in many areas of computing, from laptops and tablets to servers and Ethernet switches, multiphase phase buck regulators are seen to be more common nowadays, because of the higher power requirements. This study describes a four-stage buck converter with a phase shedding scheme that can be used to power processors in programmable logic controller (PLCs). The proposed power supply is designed to generate a regulated voltage with minimal ripple. Because of the suggested phase shedding method, this power supply also offers better light load efficiency. For this objective, a multiphase system with phase shedding is modeled in MATLAB SIMULINK, and the findings are validated.

Download Full-text

Power Efficient Frequency Scaled and Thermal-Aware Control Unit Design on FPGA

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i1111.0789s219 ◽

2019 ◽

Vol 8 (9S2) ◽

pp. 530-533

Keyword(s):

Power Consumption ◽

Energy Efficient ◽

Field Programmable Gate Array ◽

Control Unit ◽

Processing Unit ◽

Power Efficient ◽

Central Processing ◽

Unit Design ◽

Efficient Control ◽

Field Programmable

These works describe the implementation of a control unit which is an important part of Central Processing Unit (CPU) with the Field Programmable Gate Array (FPGA). In this work a frequency scaled and thermal aware energy-efficient control unit is designed with the help of 28 nanometer (nm) technology based FPGA. Frequency varies from 100MHz to 5GHz and the rise in frequency also gives rise in power consumption of control unit with FPGA. The thermal properties of FPGA also increase with increment in frequency. This whole experiment is done on Xilinx 14.1 ISE Design Suit and it is observed that lower the frequency, lower will be the power consumption of FPGA.

Download Full-text

Optimized Compression for Implementing Convolutional Neural Networks on FPGA

Electronics ◽

10.3390/electronics8030295 ◽

2019 ◽

Vol 8 (3) ◽

pp. 295 ◽

Cited By ~ 15

Author(s):

Min Zhang ◽

Linpeng Li ◽

Hai Wang ◽

Yan Liu ◽

Hongbo Qin ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Short Term Memory ◽

Graphics Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Pruning Strategy ◽

Field Programmable ◽

Storage Format ◽

Evaluation Board

Field programmable gate array (FPGA) is widely considered as a promising platform for convolutional neural network (CNN) acceleration. However, the large numbers of parameters of CNNs cause heavy computing and memory burdens for FPGA-based CNN implementation. To solve this problem, this paper proposes an optimized compression strategy, and realizes an accelerator based on FPGA for CNNs. Firstly, a reversed-pruning strategy is proposed which reduces the number of parameters of AlexNet by a factor of 13× without accuracy loss on the ImageNet dataset. Peak-pruning is further introduced to achieve better compressibility. Moreover, quantization gives another 4× with negligible loss of accuracy. Secondly, an efficient storage technique, which aims for the reduction of the whole overhead cache of the convolutional layer and the fully connected layer, is presented respectively. Finally, the effectiveness of the proposed strategy is verified by an accelerator implemented on a Xilinx ZCU104 evaluation board. By improving existing pruning techniques and the storage format of sparse data, we significantly reduce the size of AlexNet by 28×, from 243 MB to 8.7 MB. In addition, the overall performance of our accelerator achieves 9.73 fps for the compressed AlexNet. Compared with the central processing unit (CPU) and graphics processing unit (GPU) platforms, our implementation achieves 182.3× and 1.1× improvements in latency and throughput, respectively, on the convolutional (CONV) layers of AlexNet, with an 822.0× and 15.8× improvement for energy efficiency, separately. This novel compression strategy provides a reference for other neural network applications, including CNNs, long short-term memory (LSTM), and recurrent neural networks (RNNs).

Download Full-text

Recent Results on the Implementation of a Burst Error and Burst Erasure Channel Emulator Using an FPGA Architecture

Journal of Communications Software and Systems ◽

10.24138/jcomss.v16i1.766 ◽

2020 ◽

Vol 16 (1) ◽

pp. 19-29

Author(s):

Caterina Travan ◽

Francesca Vatta ◽

Fulvio Babich

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Transmission Channel ◽

Central Processing ◽

Fpga Architecture ◽

Burst Error ◽

Field Programmable ◽

Erasure Channel ◽

Burst Erasure

The behaviour of a transmission channel may be simulated using the performance abilities of current generation multiprocessing hardware, namely, a multicore Central Processing Unit (CPU), a general purpose Graphics Processing Unit (GPU), or a Field Programmable Gate Array (FPGA). These were investigated by Cullinan et al. in a recent paper (published in 2012) where these three devices capabilities were compared to determine which device would be best suited towards which specific task. In particular, it was shown that, for the application which is objective of our work (i.e., for a transmission channel simulation), the FPGA is 26.67 times faster than the GPU and 10.76 times faster than the CPU. Motivated by these results, in this paper we propose and present a direct hardware emulation. In particular, a Cyclone II FPGA architecture is implemented to simulate a burst error channel behaviour, in which errors are clustered together, and a burst erasure channel behaviour, in which the erasures are clustered together. The results presented in the paper are valid for any FPGA architecture that may be considered for this scope.

Download Full-text

SWIRL: High-performance many-core CPU code generation for deep neural networks

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019866247 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1275-1289 ◽

Cited By ~ 3

Author(s):

Anand Venkat ◽

Tharindu Rusira ◽

Raj Barik ◽

Mary Hall ◽

Leonard Truong

Keyword(s):

Neural Networks ◽

Language Processing ◽

Code Generation ◽

High Performance ◽

Deep Neural Networks ◽

Graphics Processing Unit ◽

Processing Unit ◽

Data Movement ◽

Central Processing ◽

The Status

Deep neural networks (DNNs) have demonstrated effectiveness in many domains including object recognition, speech recognition, natural language processing, and health care. Typically, the computations involved in DNN training and inferencing are time consuming and require efficient implementations. Existing frameworks such as TensorFlow, Theano, Torch, Cognitive Tool Kit (CNTK), and Caffe enable Graphics Processing Unit (GPUs) as the status quo devices for DNN execution, leaving Central Processing Unit (CPUs) behind. Moreover, existing frameworks forgo or limit cross layer optimization opportunities that have the potential to improve performance by significantly reducing data movement through the memory hierarchy. In this article, we describe an alternative approach called SWIRL, a compiler that provides high-performance CPU implementations for DNNs. SWIRL is built on top of the existing domain-specific language (DSL) for DNNs called LATTE. SWIRL separates DNN specification and its schedule using predefined transformation recipes for tensors and layers commonly found in DNN layers. These recipes synergize with DSL constructs to generate high-quality fused, vectorized, and parallelized code for CPUs. On an Intel Xeon Platinum 8180M CPU, SWIRL achieves performance comparable with Tensorflow integrated with MKL-DNN; on average 1.00× of Tensorflow inference and 0.99× of Tensorflow training. It also outperforms the original LATTE compiler on average by 1.22× and 1.30× on inference and training, respectively.

Download Full-text

SatEC: A 5G Satellite Edge Computing Framework Based on Microservice Architecture

Sensors ◽

10.3390/s19040831 ◽

2019 ◽

Vol 19 (4) ◽

pp. 831 ◽

Cited By ~ 12

Author(s):

Lei Yan ◽

Suzhi Cao ◽

Yongsheng Gong ◽

Hao Han ◽

Junyong Wei ◽

...

Keyword(s):

Graphics Processing Unit ◽

Edge Computing ◽

Satellite Network ◽

Processing Unit ◽

Central Processing ◽

5G Network ◽

Field Programmable ◽

High Bandwidth ◽

Basic Services ◽

Computing Framework

As outlined in the 3Gpp Release 16, 5G satellite access is important for 5G network development in the future. A terrestrial-satellite network integrated with 5G has the characteristics of low delay, high bandwidth, and ubiquitous coverage. A few researchers have proposed integrated schemes for such a network; however, these schemes do not consider the possibility of achieving optimization of the delay characteristic by changing the computing mode of the 5G satellite network. We propose a 5G satellite edge computing framework (5GsatEC), which aims to reduce delay and expand network coverage. This framework consists of embedded hardware platforms and edge computing microservices in satellites. To increase the flexibility of the framework in complex scenarios, we unify the resource management of the central processing unit (CPU), graphics processing unit (GPU), and field-programmable gate array (FPGA); we divide the services into three types: system services, basic services, and user services. In order to verify the performance of the framework, we carried out a series of experiments. The results show that 5GsatEC has a broader coverage than the ground 5G network. The results also show that 5GsatEC has lower delay, a lower packet loss rate, and lower bandwidth consumption than the 5G satellite network.

Download Full-text