Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices

2021 ◽  
Vol 20 (5s) ◽  
pp. 1-24
Author(s):  
Xinyi Zhang ◽  
Yawen Wu ◽  
Peipei Zhou ◽  
Xulong Tang ◽  
Jingtong Hu

Multi-head self-attention (attention mechanism) has been employed in a variety of fields such as machine translation, language modeling, and image processing due to its superiority in feature extraction and sequential data analysis. This is benefited from a large number of parameters and sophisticated model architecture behind the attention mechanism. To efficiently deploy attention mechanism on resource-constrained devices, existing works propose to reduce the model size by building a customized smaller model or compressing a big standard model. A customized smaller model is usually optimized for the specific task and needs effort in model parameters exploration. Model compression reduces model size without hurting the model architecture robustness, which can be efficiently applied to different tasks. The compressed weights in the model are usually regularly shaped (e.g. rectangle) but the dimension sizes vary (e.g. differs in rectangle height and width). Such compressed attention mechanism can be efficiently deployed on CPU/GPU platforms as their memory and computing resources can be flexibly assigned with demand. However, for Field Programmable Gate Arrays (FPGAs), the data buffer allocation and computing kernel are fixed at run time to achieve maximum energy efficiency. After compression, weights are much smaller and different in size, which leads to inefficient utilization of FPGA on-chip buffer. Moreover, the different weight heights and widths may lead to inefficient FPGA computing kernel execution. Due to the large number of weights in the attention mechanism, building a unique buffer and computing kernel for each compressed weight on FPGA is not feasible. In this work, we jointly consider the compression impact on buffer allocation and the required computing kernel during the attention mechanism compressing. A novel structural pruning method with memory footprint awareness is proposed and the associated accelerator on FPGA is designed. The experimental results show that our work can compress Transformer (an attention mechanism based model) by 95x. The developed accelerator can fully utilize the FPGA resource, processing the sparse attention mechanism with the run-time throughput performance of 1.87 Tops in ZCU102 FPGA.

2020 ◽  
Vol 34 (04) ◽  
pp. 6623-6630
Author(s):  
Li Yang ◽  
Zhezhi He ◽  
Deliang Fan

Deep convolutional neural network (DNN) has demonstrated phenomenal success and been widely used in many computer vision tasks. However, its enormous model size and high computing complexity prohibits its wide deployment into resource limited embedded system, such as FPGA and mGPU. As the two most widely adopted model compression techniques, weight pruning and quantization compress DNN model through introducing weight sparsity (i.e., forcing partial weights as zeros) and quantizing weights into limited bit-width values, respectively. Although there are works attempting to combine the weight pruning and quantization, we still observe disharmony between weight pruning and quantization, especially when more aggressive compression schemes (e.g., Structured pruning and low bit-width quantization) are used. In this work, taking FPGA as the test computing platform and Processing Elements (PE) as the basic parallel computing unit, we first propose a PE-wise structured pruning scheme, which introduces weight sparsification with considering of the architecture of PE. In addition, we integrate it with an optimized weight ternarization approach which quantizes weights into ternary values ({-1,0,+1}), thus converting the dominant convolution operations in DNN from multiplication-and-accumulation (MAC) to addition-only, as well as compressing the original model (from 32-bit floating point to 2-bit ternary representation) by at least 16 times. Then, we investigate and solve the coexistence issue between PE-wise Structured pruning and ternarization, through proposing a Weight Penalty Clipping (WPC) technique with self-adapting threshold. Our experiment shows that the fusion of our proposed techniques can achieve the best state-of-the-art ∼21× PE-wise structured compression rate with merely 1.74%/0.94% (top-1/top-5) accuracy degradation of ResNet-18 on ImageNet dataset.


2015 ◽  
Vol 2015 ◽  
pp. 1-15 ◽  
Author(s):  
Luis Andres Cardona ◽  
Carles Ferrer

The Internal Configuration Access Port (ICAP) is the core component of any dynamic partial reconfigurable system implemented in Xilinx SRAM-based Field Programmable Gate Arrays (FPGAs). We developed a new high speed ICAP controller, named AC_ICAP, completely implemented in hardware. In addition to similar solutions to accelerate the management of partial bitstreams and frames, AC_ICAP also supports run-time reconfiguration of LUTs without requiring precomputed partial bitstreams. This last characteristic was possible by performing reverse engineering on the bitstream. Besides, we adapted this hardware-based solution to provide IP cores accessible from the MicroBlaze processor. To this end, the controller was extended and three versions were implemented to evaluate its performance when connected to Peripheral Local Bus (PLB), Fast Simplex Link (FSL), and AXI interfaces of the processor. In consequence, the controller can exploit the flexibility that the processor offers but taking advantage of the hardware speed-up. It was implemented in both Virtex-5 and Kintex7 FPGAs. Results of reconfiguration time showed that run-time reconfiguration of single LUTs in Virtex-5 devices was performed in less than 5 μs which implies a speed-up of more than 380x compared to the Xilinx XPS_HWICAP controller.


2016 ◽  
Vol 78 (7-5) ◽  
Author(s):  
Muhammad Amin Hashim ◽  
Yuan Wen Hau ◽  
Rabia Baktheri

This paper studies two different Electrocardiography (ECG) preprocessing algorithms, namely Pan and Tompkins (PT) and Derivative Based (DB) algorithm, which is crucial of QRS complex detection in cardiovascular disease detection. Both algorithms are compared in terms of QRS detection accuracy and computation timing performance, with implementation on System-on-Chip (SoC) based embedded system that prototype on Altera DE2-115 Field Programmable Gate Array (FPGA) platform as embedded software. Both algorithms are tested with 30 minutes ECG data from each of 48 different patient records obtain from MIT-BIH arrhythmia database. Results show that PT algorithm achieve 98.15% accuracy with 56.33 seconds computation while DB algorithm achieve 96.74% with only 22.14 seconds processing time. Based on the study, an optimized PT algorithm with improvement on Moving Windows Integrator (MWI) has been proposed to accelerate its computation. Result shows that the proposed optimized Moving Windows Integrator algorithm achieves 9.5 times speed up than original MWI while retaining its QRS detection accuracy. 


2021 ◽  
Vol 27 (3) ◽  
pp. 57-70
Author(s):  
Damjan M. Rakanovic ◽  
Vuk Vranjkovic ◽  
Rastislav J. R. Struharik

Paper proposes a two-step Convolutional Neural Network (CNN) pruning algorithm and resource-efficient Field-programmable gate array (FPGA) CNN accelerator named “Argus”. The proposed CNN pruning algorithm first combines similar kernels into clusters, which are then pruned using the same regular pruning pattern. The pruning algorithm is carefully tailored for FPGAs, considering their resource characteristics. Regular sparsity results in high Multiply-accumulate (MAC) efficiency, reducing the amount of logic required to balance workloads among different MAC units. As a result, the Argus accelerator requires about 170 Look-up tables (LUTs) per Digital Signal Processor (DSP) block. This number is close to the average LUT/DPS ratio for various FPGA families, enabling balanced resource utilization when implementing Argus. Benchmarks conducted using Xilinx Zynq Ultrascale + Multi-Processor System-on-Chip (MPSoC) indicate that Argus is achieving up to 25 times higher frames per second than NullHop, 2 and 2.5 times higher than NEURAghe and Snowflake, respectively, and 2 times higher than NVDLA. Argus shows comparable performance to MIT’s Eyeriss v2 and Caffeine, requiring up to 3 times less memory bandwidth and utilizing 4 times fewer DSP blocks, respectively. Besides the absolute performance, Argus has at least 1.3 and 2 times better GOP/s/DSP and GOP/s/Block-RAM (BRAM) ratios, while being competitive in terms of GOP/s/LUT, compared to some of the state-of-the-art solutions.


Electronics ◽  
2021 ◽  
Vol 10 (18) ◽  
pp. 2272
Author(s):  
Safa Bouguezzi ◽  
Hana Ben Fredj ◽  
Tarek Belabed ◽  
Carlos Valderrama ◽  
Hassene Faiedh ◽  
...  

Convolutional Neural Networks (CNN) continue to dominate research in the area of hardware acceleration using Field Programmable Gate Arrays (FPGA), proving its effectiveness in a variety of computer vision applications such as object segmentation, image classification, face detection, and traffic signs recognition, among others. However, there are numerous constraints for deploying CNNs on FPGA, including limited on-chip memory, CNN size, and configuration parameters. This paper introduces Ad-MobileNet, an advanced CNN model inspired by the baseline MobileNet model. The proposed model uses an Ad-depth engine, which is an improved version of the depth-wise separable convolution unit. Moreover, we propose an FPGA-based implementation model that supports the Mish, TanhExp, and ReLU activation functions. The experimental results using the CIFAR-10 dataset show that our Ad-MobileNet has a classification accuracy of 88.76% while requiring little computational hardware resources. Compared to state-of-the-art methods, our proposed method has a fairly high recognition rate while using fewer computational hardware resources. Indeed, the proposed model helps to reduce hardware resources by more than 41% compared to that of the baseline model.


2021 ◽  
Author(s):  
Jiaojiao Wang ◽  
Dongjin Yu ◽  
Chengfei Liu ◽  
Xiaoxiao Sun

Abstract To effectively predict the outcome of an on-going process instance helps make an early decision, which plays an important role in so-called predictive process monitoring. Existing methods in this field are tailor-made for some empirical operations such as the prefix extraction, clustering, and encoding, leading that their relative accuracy is highly sensitive to the dataset. Moreover, they have limitations in real-time prediction applications due to the lengthy prediction time. Since Long Short-term Memory (LSTM) neural network provides a high precision in the prediction of sequential data in several areas, this paper investigates LSTM and its enhancements and proposes three different approaches to build more effective and efficient models for outcome prediction. The first move on enhancement is that we combine the original LSTM network from two directions, forward and backward, to capture more features from the completed cases. The second move on enhancement is that we add attention mechanism after extracting features in the hidden layer of LSTM network to distinct them from their attention weight. A series of extensive experiments are evaluated on twelve real datasets when comparing with other approaches. The results show that our approaches outperform the state-of-the-art ones in terms of prediction effectiveness and time performance.


Sign in / Sign up

Export Citation Format

Share Document