Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices

Xinyi Zhang; Yawen Wu; Peipei Zhou; Xulong Tang; Jingtong Hu

doi:10.1145/3477002

Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices

ACM Transactions on Embedded Computing Systems ◽

10.1145/3477002 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-24

Author(s):

Xinyi Zhang ◽

Yawen Wu ◽

Peipei Zhou ◽

Xulong Tang ◽

Jingtong Hu

Keyword(s):

Maximum Energy ◽

Attention Mechanism ◽

Model Parameters ◽

Sequential Data ◽

Buffer Allocation ◽

Model Compression ◽

Field Programmable ◽

Model Size ◽

Run Time ◽

On Chip

Multi-head self-attention (attention mechanism) has been employed in a variety of fields such as machine translation, language modeling, and image processing due to its superiority in feature extraction and sequential data analysis. This is benefited from a large number of parameters and sophisticated model architecture behind the attention mechanism. To efficiently deploy attention mechanism on resource-constrained devices, existing works propose to reduce the model size by building a customized smaller model or compressing a big standard model. A customized smaller model is usually optimized for the specific task and needs effort in model parameters exploration. Model compression reduces model size without hurting the model architecture robustness, which can be efficiently applied to different tasks. The compressed weights in the model are usually regularly shaped (e.g. rectangle) but the dimension sizes vary (e.g. differs in rectangle height and width). Such compressed attention mechanism can be efficiently deployed on CPU/GPU platforms as their memory and computing resources can be flexibly assigned with demand. However, for Field Programmable Gate Arrays (FPGAs), the data buffer allocation and computing kernel are fixed at run time to achieve maximum energy efficiency. After compression, weights are much smaller and different in size, which leads to inefficient utilization of FPGA on-chip buffer. Moreover, the different weight heights and widths may lead to inefficient FPGA computing kernel execution. Due to the large number of weights in the attention mechanism, building a unique buffer and computing kernel for each compressed weight on FPGA is not feasible. In this work, we jointly consider the compression impact on buffer allocation and the required computing kernel during the attention mechanism compressing. A novel structural pruning method with memory footprint awareness is proposed and the associated accelerator on FPGA is designed. The experimental results show that our work can compress Transformer (an attention mechanism based model) by 95x. The developed accelerator can fully utilize the FPGA resource, processing the sparse attention mechanism with the run-time throughput performance of 1.87 Tops in ZCU102 FPGA.

Download Full-text

Reconfigurable field‐programmable gate array‐based on‐chip learning neuromorphic digital implementation for nonlinear function approximation

International Journal of Circuit Theory and Applications ◽

10.1002/cta.3075 ◽

2021 ◽

Author(s):

Morteza Gholami ◽

Edris Zaman Farsa ◽

Gholamreza Karimi

Keyword(s):

Field Programmable Gate Array ◽

Function Approximation ◽

Nonlinear Function ◽

Digital Implementation ◽

Field Programmable ◽

Gate Array ◽

On Chip ◽

Nonlinear Function Approximation

Download Full-text

Harmonious Coexistence of Structured Weight Pruning and Ternarization for Deep Neural Networks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6138 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6623-6630

Author(s):

Li Yang ◽

Zhezhi He ◽

Deliang Fan

Keyword(s):

Embedded System ◽

Processing Elements ◽

Computing Platform ◽

Model Compression ◽

Computing Unit ◽

Resource Limited ◽

Adopted Model ◽

Weight Penalty ◽

Model Size ◽

Weight Pruning

Deep convolutional neural network (DNN) has demonstrated phenomenal success and been widely used in many computer vision tasks. However, its enormous model size and high computing complexity prohibits its wide deployment into resource limited embedded system, such as FPGA and mGPU. As the two most widely adopted model compression techniques, weight pruning and quantization compress DNN model through introducing weight sparsity (i.e., forcing partial weights as zeros) and quantizing weights into limited bit-width values, respectively. Although there are works attempting to combine the weight pruning and quantization, we still observe disharmony between weight pruning and quantization, especially when more aggressive compression schemes (e.g., Structured pruning and low bit-width quantization) are used. In this work, taking FPGA as the test computing platform and Processing Elements (PE) as the basic parallel computing unit, we first propose a PE-wise structured pruning scheme, which introduces weight sparsification with considering of the architecture of PE. In addition, we integrate it with an optimized weight ternarization approach which quantizes weights into ternary values ({-1,0,+1}), thus converting the dominant convolution operations in DNN from multiplication-and-accumulation (MAC) to addition-only, as well as compressing the original model (from 32-bit floating point to 2-bit ternary representation) by at least 16 times. Then, we investigate and solve the coexistence issue between PE-wise Structured pruning and ternarization, through proposing a Weight Penalty Clipping (WPC) technique with self-adapting threshold. Our experiment shows that the fusion of our proposed techniques can achieve the best state-of-the-art ∼21× PE-wise structured compression rate with merely 1.74%/0.94% (top-1/top-5) accuracy degradation of ResNet-18 on ImageNet dataset.

Download Full-text

Comparative analysis of soft and hard on-chip interconnects for field-programmable gate arrays

IET Computers & Digital Techniques ◽

10.1049/iet-cdt.2011.0169 ◽

2012 ◽

Vol 6 (6) ◽

pp. 396-405 ◽

Cited By ~ 2

Author(s):

J.Y. Hur ◽

M.A. Wahlah ◽

L. Mhamdi ◽

K. Goossens

Keyword(s):

Comparative Analysis ◽

Field Programmable Gate Arrays ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

On Chip

Download Full-text

A System-On-Chip Approach in Designing a Dedicated RISC Microcontroller Unit Using the Field-Programmable Gate Array

2010 Fifth International Conference on Systems ◽

10.1109/icons.2010.40 ◽

2010 ◽

Author(s):

Elena Roxana Buhus ◽

Alexandru Lazar ◽

Adriano Tavares

Keyword(s):

Field Programmable Gate Array ◽

System On Chip ◽

Field Programmable ◽

Gate Array ◽

On Chip ◽

Microcontroller Unit

Download Full-text

AC_ICAP: A Flexible High Speed ICAP Controller

International Journal of Reconfigurable Computing ◽

10.1155/2015/314358 ◽

2015 ◽

Vol 2015 ◽

pp. 1-15 ◽

Cited By ~ 6

Author(s):

Luis Andres Cardona ◽

Carles Ferrer

Keyword(s):

High Speed ◽

Access Port ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Speed Up ◽

Run Time ◽

Ip Cores ◽

EFFICIENT QRS COMPLEX DETECTION ALGORITHM IMPLEMENTATION ON SOC-BASED EMBEDDED SYSTEM

Jurnal Teknologi ◽

10.11113/jt.v78.9450 ◽

2016 ◽

Vol 78 (7-5) ◽

Cited By ~ 1

Author(s):

Muhammad Amin Hashim ◽

Yuan Wen Hau ◽

Rabia Baktheri

Keyword(s):

Embedded System ◽

Detection Algorithm ◽

Detection Accuracy ◽

Qrs Complex ◽

Qrs Detection ◽

Qrs Complex Detection ◽

Moving Windows ◽

Field Programmable ◽

Complex Detection ◽

On Chip

This paper studies two different Electrocardiography (ECG) preprocessing algorithms, namely Pan and Tompkins (PT) and Derivative Based (DB) algorithm, which is crucial of QRS complex detection in cardiovascular disease detection. Both algorithms are compared in terms of QRS detection accuracy and computation timing performance, with implementation on System-on-Chip (SoC) based embedded system that prototype on Altera DE2-115 Field Programmable Gate Array (FPGA) platform as embedded software. Both algorithms are tested with 30 minutes ECG data from each of 48 different patient records obtain from MIT-BIH arrhythmia database. Results show that PT algorithm achieve 98.15% accuracy with 56.33 seconds computation while DB algorithm achieve 96.74% with only 22.14 seconds processing time. Based on the study, an optimized PT algorithm with improvement on Moving Windows Integrator (MWI) has been proposed to accelerate its computation. Result shows that the proposed optimized Moving Windows Integrator algorithm achieves 9.5 times speed up than original MWI while retaining its QRS detection accuracy.

Download Full-text

Argus CNN Accelerator Based on Kernel Clustering and Resource-Aware Pruning

Elektronika ir Elektrotechnika ◽

10.5755/j02.eie.28922 ◽

2021 ◽

Vol 27 (3) ◽

pp. 57-70

Author(s):

Damjan M. Rakanovic ◽

Vuk Vranjkovic ◽

Rastislav J. R. Struharik

Keyword(s):

Digital Signal Processor ◽

State Of The Art ◽

Digital Signal ◽

Pruning Algorithm ◽

Kernel Clustering ◽

Field Programmable ◽

Comparable Performance ◽

On Chip ◽

Resource Characteristics ◽

Resource Aware

Paper proposes a two-step Convolutional Neural Network (CNN) pruning algorithm and resource-efficient Field-programmable gate array (FPGA) CNN accelerator named “Argus”. The proposed CNN pruning algorithm first combines similar kernels into clusters, which are then pruned using the same regular pruning pattern. The pruning algorithm is carefully tailored for FPGAs, considering their resource characteristics. Regular sparsity results in high Multiply-accumulate (MAC) efficiency, reducing the amount of logic required to balance workloads among different MAC units. As a result, the Argus accelerator requires about 170 Look-up tables (LUTs) per Digital Signal Processor (DSP) block. This number is close to the average LUT/DPS ratio for various FPGA families, enabling balanced resource utilization when implementing Argus. Benchmarks conducted using Xilinx Zynq Ultrascale + Multi-Processor System-on-Chip (MPSoC) indicate that Argus is achieving up to 25 times higher frames per second than NullHop, 2 and 2.5 times higher than NEURAghe and Snowflake, respectively, and 2 times higher than NVDLA. Argus shows comparable performance to MIT’s Eyeriss v2 and Caffeine, requiring up to 3 times less memory bandwidth and utilizing 4 times fewer DSP blocks, respectively. Besides the absolute performance, Argus has at least 1.3 and 2 times better GOP/s/DSP and GOP/s/Block-RAM (BRAM) ratios, while being competitive in terms of GOP/s/LUT, compared to some of the state-of-the-art solutions.

Download Full-text

An Efficient FPGA-Based Convolutional Neural Network for Classification: Ad-MobileNet

Electronics ◽

10.3390/electronics10182272 ◽

2021 ◽

Vol 10 (18) ◽

pp. 2272

Author(s):

Safa Bouguezzi ◽

Hana Ben Fredj ◽

Tarek Belabed ◽

Carlos Valderrama ◽

Hassene Faiedh ◽

...

Keyword(s):

Recognition Rate ◽

Hardware Acceleration ◽

Implementation Model ◽

Gate Arrays ◽

Proposed Model ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Computer Vision Applications ◽

On Chip ◽

Segmentation Image

Convolutional Neural Networks (CNN) continue to dominate research in the area of hardware acceleration using Field Programmable Gate Arrays (FPGA), proving its effectiveness in a variety of computer vision applications such as object segmentation, image classification, face detection, and traffic signs recognition, among others. However, there are numerous constraints for deploying CNNs on FPGA, including limited on-chip memory, CNN size, and configuration parameters. This paper introduces Ad-MobileNet, an advanced CNN model inspired by the baseline MobileNet model. The proposed model uses an Ad-depth engine, which is an improved version of the depth-wise separable convolution unit. Moreover, we propose an FPGA-based implementation model that supports the Mish, TanhExp, and ReLU activation functions. The experimental results using the CIFAR-10 dataset show that our Ad-MobileNet has a classification accuracy of 88.76% while requiring little computational hardware resources. Compared to state-of-the-art methods, our proposed method has a fairly high recognition rate while using fewer computational hardware resources. Indeed, the proposed model helps to reduce hardware resources by more than 41% compared to that of the baseline model.

Download Full-text

Predicting Outcomes of Business Process Executions Based on LSTM Neural Networks and Attention Mechanism

10.21203/rs.3.rs-260970/v1 ◽

2021 ◽

Author(s):

Jiaojiao Wang ◽

Dongjin Yu ◽

Chengfei Liu ◽

Xiaoxiao Sun

Keyword(s):

Short Term Memory ◽

Attention Mechanism ◽

Sequential Data ◽

Time Prediction ◽

Prediction Time ◽

Highly Sensitive ◽

Long Short Term Memory ◽

Early Decision ◽

Lstm Network ◽

Hidden Layer

Abstract To effectively predict the outcome of an on-going process instance helps make an early decision, which plays an important role in so-called predictive process monitoring. Existing methods in this field are tailor-made for some empirical operations such as the prefix extraction, clustering, and encoding, leading that their relative accuracy is highly sensitive to the dataset. Moreover, they have limitations in real-time prediction applications due to the lengthy prediction time. Since Long Short-term Memory (LSTM) neural network provides a high precision in the prediction of sequential data in several areas, this paper investigates LSTM and its enhancements and proposes three different approaches to build more effective and efficient models for outcome prediction. The first move on enhancement is that we combine the original LSTM network from two directions, forward and backward, to capture more features from the completed cases. The second move on enhancement is that we add attention mechanism after extracting features in the hidden layer of LSTM network to distinct them from their attention weight. A series of extensive experiments are evaluated on twelve real datasets when comparing with other approaches. The results show that our approaches outperform the state-of-the-art ones in terms of prediction effectiveness and time performance.

Download Full-text