Improving Power of DSP and CNN Hardware Accelerators Using Approximate Floating-point Multipliers

Vasileios Leon; Theodora Paparouni; Evangelos Petrongonas; Dimitrios Soudris; Kiamal Pekmestzi

doi:10.1145/3448980

Improving Power of DSP and CNN Hardware Accelerators Using Approximate Floating-point Multipliers

ACM Transactions on Embedded Computing Systems ◽

10.1145/3448980 ◽

2021 ◽

Vol 20 (5) ◽

pp. 1-21

Author(s):

Vasileios Leon ◽

Theodora Paparouni ◽

Evangelos Petrongonas ◽

Dimitrios Soudris ◽

Kiamal Pekmestzi

Keyword(s):

Floating Point ◽

Approximate Computing ◽

Dynamic Configuration ◽

Embedded Computing ◽

Power Efficient ◽

Approximation Techniques ◽

Energy Efficient Computing ◽

Error Resiliency ◽

Inherent Error ◽

Energy Gains

Approximate computing has emerged as a promising design alternative for delivering power-efficient systems and circuits by exploiting the inherent error resiliency of numerous applications. The current article aims to tackle the increased hardware cost of floating-point multiplication units, which prohibits their usage in embedded computing. We introduce AFMU (Approximate Floating-point MUltiplier), an area/power-efficient family of multipliers, which apply two approximation techniques in the resource-hungry mantissa multiplication and can be seamlessly extended to support dynamic configuration of the approximation levels via gating signals. AFMU offers large accuracy configuration margins, provides negligible logic overhead for dynamic configuration, and detects unexpected results that may arise due to the approximations. Our evaluation shows that AFMU delivers energy gains in the range 3.6%–53.5% for half-precision and 37.2%–82.4% for single-precision, in exchange for mean relative error around 0.05%–3.33% and 0.01%–2.20%, respectively. In comparison with state-of-the-art multipliers, AFMU exhibits up to 4–6× smaller error on average while delivering more energy-efficient computing. The evaluation in image processing shows that AFMU provides sufficient quality of service, i.e., more than 50 db PSNR and near 1 SSIM values, and up to 57.4% power reduction. When used in floating-point CNNs, the accuracy loss is small (or zero), i.e., up to 5.4% for MNIST and CIFAR-10, in exchange for up to 63.8% power gain.

Download Full-text

Design of area and power efficient Radix-4 DIT FFT butterfly unit using floating point fused arithmetic

Journal of Central South University ◽

10.1007/s11771-016-3221-y ◽

2016 ◽

Vol 23 (7) ◽

pp. 1669-1681 ◽

Cited By ~ 9

Author(s):

E. Prabhu ◽

H. Mangalam ◽

S. Karthick

Keyword(s):

Floating Point ◽

Power Efficient

Download Full-text

FPGA Implementation of Power Efficient Floating Point Fused Multiply-Add Unit

2021 10th IEEE International Conference on Communication Systems and Network Technologies (CSNT) ◽

10.1109/csnt51715.2021.9509678 ◽

2021 ◽

Author(s):

K. Mounika ◽

P.V. Ramana

Keyword(s):

Fpga Implementation ◽

Floating Point ◽

Power Efficient

Download Full-text

FPGA-Based Scalable and Power-Efficient Fluid Simulation using Floating-Point DSP Blocks

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2017.2691770 ◽

2017 ◽

Vol 28 (10) ◽

pp. 2823-2837 ◽

Cited By ~ 16

Author(s):

Kentaro Sano ◽

Satoru Yamamoto

Keyword(s):

Fluid Simulation ◽

Floating Point ◽

Power Efficient

Download Full-text

Power-Efficient Computing: Experiences from the COSA Project

Scientific Programming ◽

10.1155/2017/7206595 ◽

2017 ◽

Vol 2017 ◽

pp. 1-14 ◽

Cited By ~ 5

Author(s):

Daniele Cesini ◽

Elena Corni ◽

Antonio Falabella ◽

Andrea Ferraro ◽

Lucia Morganti ◽

...

Keyword(s):

Power Systems ◽

Energy Efficient ◽

Nuclear Physics ◽

Energy Performance ◽

High Energy ◽

High Ratio ◽

Scientific Applications ◽

Computing Systems ◽

Power Efficient ◽

Energy Efficient Computing

Energy consumption is today one of the most relevant issues in operating HPC systems for scientific applications. The use of unconventional computing systems is therefore of great interest for several scientific communities looking for a better tradeoff between time-to-solution and energy-to-solution. In this context, the performance assessment of processors with a high ratio of performance per watt is necessary to understand how to realize energy-efficient computing systems for scientific applications, using this class of processors. Computing On SOC Architecture (COSA) is a three-year project (2015–2017) funded by the Scientific Commission V of the Italian Institute for Nuclear Physics (INFN), which aims to investigate the performance and the total cost of ownership offered by computing systems based on commodity low-power Systems on Chip (SoCs) and high energy-efficient systems based on GP-GPUs. In this work, we present the results of the project analyzing the performance of several scientific applications on several GPU- and SoC-based systems. We also describe the methodology we have used to measure energy performance and the tools we have implemented to monitor the power drained by applications while running.

Download Full-text

Approximate Computing: An Energy-Efficient Computing Technique for Error Resilient Applications

2015 IEEE Computer Society Annual Symposium on VLSI ◽

10.1109/isvlsi.2015.130 ◽

2015 ◽

Cited By ~ 14

Author(s):

Kaushik Roy ◽

Anand Raghunathan

Keyword(s):

Energy Efficient ◽

Approximate Computing ◽

Computing Technique ◽

Error Resilient ◽

Energy Efficient Computing

Download Full-text

A unified flagged prefix constant addition-subtraction scheme for design of area and power efficient binary floating-point and constant integer arithmetic circuits

2014 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS) ◽

10.1109/apccas.2014.7032721 ◽

2014 ◽

Author(s):

Soumya Ganguly ◽

Abhishek Mittal ◽

Syed Ershad Ahmed ◽

M.B. Srinivas

Keyword(s):

Arithmetic Circuits ◽

Floating Point ◽

Subtraction Scheme ◽

Power Efficient ◽

Integer Arithmetic ◽

Constant Addition

Download Full-text

Implementation of Embedded Floating Point Arithmetic Units on FPGA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.550.126 ◽

2014 ◽

Vol 550 ◽

pp. 126-136

Author(s):

N. Ramya Rani

Keyword(s):

High Speed ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Embedded Computing ◽

Floating Point Arithmetic ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Arithmetic Units ◽

Point Arithmetic

:Floating point arithmetic plays a major role in scientific and embedded computing applications. But the performance of field programmable gate arrays (FPGAs) used for floating point applications is poor due to the complexity of floating point arithmetic. The implementation of floating point units on FPGAs consumes a large amount of resources and that leads to the development of embedded floating point units in FPGAs. Embedded applications like multimedia, communication and DSP algorithms use floating point arithmetic in processing graphics, Fourier transformation, coding, etc. In this paper, methodologies are presented for the implementation of embedded floating point units on FPGA. The work is focused with the aim of achieving high speed of computations and to reduce the power for evaluating expressions. An application that demands high performance floating point computation can achieve better speed and density by incorporating embedded floating point units. Additionally this paper describes a comparative study of the design of single precision and double precision pipelined floating point arithmetic units for evaluating expressions. The modules are designed using VHDL simulation in Xilinx software and implemented on VIRTEX and SPARTAN FPGAs.

Download Full-text

Area‐ and power‐efficient iterative single/double‐precision merged floating‐point multiplier on FPGA

IET Computers & Digital Techniques ◽

10.1049/iet-cdt.2016.0100 ◽

2017 ◽

Vol 11 (4) ◽

pp. 149-158 ◽

Cited By ~ 3

Author(s):

Hao Zhang ◽

Dongdong Chen ◽

Seok‐Bum Ko

Keyword(s):

Floating Point ◽

Double Precision ◽

Power Efficient

Download Full-text

A novel power efficient 0.64-GFlops fused 32-bit reversible floating point arithmetic unit architecture for digital signal processing applications

Microprocessors and Microsystems ◽

10.1016/j.micpro.2017.01.002 ◽

2017 ◽

Vol 51 ◽

pp. 366-385 ◽

Cited By ~ 3

Author(s):

A.V. AnanthaLakshmi ◽

Gnanou Florence Sudha

Keyword(s):

Signal Processing ◽

Digital Signal Processing ◽

Digital Signal ◽

Floating Point ◽

Arithmetic Unit ◽

Power Efficient ◽

Floating Point Arithmetic ◽

Point Arithmetic

Download Full-text

Design Space Exploration on High-Order QAM Demodulation Circuits: Algorithms, Arithmetic and Approximation Techniques

Electronics ◽

10.3390/electronics11010039 ◽

2021 ◽

Vol 11 (1) ◽

pp. 39

Author(s):

Ioannis Stratakos ◽

Vasileios Leon ◽

Giorgos Armeniakos ◽

George Lentaris ◽

Dimitrios Soudris

Keyword(s):

Fixed Point ◽

Orthogonal Frequency Division Multiplexing ◽

Design Space Exploration ◽

Performance Metrics ◽

Circuit Complexity ◽

Error Rates ◽

High Order ◽

Approximate Computing ◽

Clock Frequency ◽

Approximation Techniques

Every new generation of wireless communication standard aims to improve the overall performance and quality of service (QoS), compared to the previous generations. Increased data rates, numbers and capabilities of connected devices, new applications, and higher data volume transfers are some of the key parameters that are of interest. To satisfy these increased requirements, the synergy between wireless technologies and optical transport will dominate the 5G network topologies. This work focuses on a fundamental digital function in an orthogonal frequency-division multiplexing (OFDM) baseband transceiver architecture and aims at improving the throughput and circuit complexity of this function. Specifically, we consider the high-order QAM demodulation and apply approximation techniques to achieve our goals. We adopt approximate computing as a design strategy to exploit the error resiliency of the QAM function and deliver significant gains in terms of critical performance metrics. Particularly, we take into consideration and explore four demodulation algorithms and develop accurate floating- and fixed-point circuits in VHDL. In addition, we further explore the effects of introducing approximate arithmetic components. For our test case, we consider 64-QAM demodulators, and the results suggest that the most promising design provides bit error rates (BER) ranging from 10−1 to 10−4 for SNR 0–14 dB in terms of accuracy. Targeting a Xilinx Zynq Ultrascale+ ZCU106 (XCZU7EV) FPGA device, the approximate circuits achieve up to 98% reduction in LUT utilization, compared to the accurate floating-point model of the same algorithm, and up to a 122% increase in operating frequency. In terms of power consumption, our most efficient circuit configurations consume 0.6–1.1 W when operating at their maximum clock frequency. Our results show that if the objective is to achieve high accuracy in terms of BER, the prevailing solution is the approximate LLR algorithm configured with fixed-point arithmetic and 8-bit truncation, providing 81% decrease in LUTs and 13% increase in frequency and sustains a throughput of 323 Msamples/s.

Download Full-text