Improving Power of DSP and CNN Hardware Accelerators Using Approximate Floating-point Multipliers

2021 ◽  
Vol 20 (5) ◽  
pp. 1-21
Author(s):  
Vasileios Leon ◽  
Theodora Paparouni ◽  
Evangelos Petrongonas ◽  
Dimitrios Soudris ◽  
Kiamal Pekmestzi

Approximate computing has emerged as a promising design alternative for delivering power-efficient systems and circuits by exploiting the inherent error resiliency of numerous applications. The current article aims to tackle the increased hardware cost of floating-point multiplication units, which prohibits their usage in embedded computing. We introduce AFMU (Approximate Floating-point MUltiplier), an area/power-efficient family of multipliers, which apply two approximation techniques in the resource-hungry mantissa multiplication and can be seamlessly extended to support dynamic configuration of the approximation levels via gating signals. AFMU offers large accuracy configuration margins, provides negligible logic overhead for dynamic configuration, and detects unexpected results that may arise due to the approximations. Our evaluation shows that AFMU delivers energy gains in the range 3.6%–53.5% for half-precision and 37.2%–82.4% for single-precision, in exchange for mean relative error around 0.05%–3.33% and 0.01%–2.20%, respectively. In comparison with state-of-the-art multipliers, AFMU exhibits up to 4–6× smaller error on average while delivering more energy-efficient computing. The evaluation in image processing shows that AFMU provides sufficient quality of service, i.e., more than 50 db PSNR and near 1 SSIM values, and up to 57.4% power reduction. When used in floating-point CNNs, the accuracy loss is small (or zero), i.e., up to 5.4% for MNIST and CIFAR-10, in exchange for up to 63.8% power gain.

2017 ◽  
Vol 2017 ◽  
pp. 1-14 ◽  
Author(s):  
Daniele Cesini ◽  
Elena Corni ◽  
Antonio Falabella ◽  
Andrea Ferraro ◽  
Lucia Morganti ◽  
...  

Energy consumption is today one of the most relevant issues in operating HPC systems for scientific applications. The use of unconventional computing systems is therefore of great interest for several scientific communities looking for a better tradeoff between time-to-solution and energy-to-solution. In this context, the performance assessment of processors with a high ratio of performance per watt is necessary to understand how to realize energy-efficient computing systems for scientific applications, using this class of processors. Computing On SOC Architecture (COSA) is a three-year project (2015–2017) funded by the Scientific Commission V of the Italian Institute for Nuclear Physics (INFN), which aims to investigate the performance and the total cost of ownership offered by computing systems based on commodity low-power Systems on Chip (SoCs) and high energy-efficient systems based on GP-GPUs. In this work, we present the results of the project analyzing the performance of several scientific applications on several GPU- and SoC-based systems. We also describe the methodology we have used to measure energy performance and the tools we have implemented to monitor the power drained by applications while running.


2014 ◽  
Vol 550 ◽  
pp. 126-136
Author(s):  
N. Ramya Rani

:Floating point arithmetic plays a major role in scientific and embedded computing applications. But the performance of field programmable gate arrays (FPGAs) used for floating point applications is poor due to the complexity of floating point arithmetic. The implementation of floating point units on FPGAs consumes a large amount of resources and that leads to the development of embedded floating point units in FPGAs. Embedded applications like multimedia, communication and DSP algorithms use floating point arithmetic in processing graphics, Fourier transformation, coding, etc. In this paper, methodologies are presented for the implementation of embedded floating point units on FPGA. The work is focused with the aim of achieving high speed of computations and to reduce the power for evaluating expressions. An application that demands high performance floating point computation can achieve better speed and density by incorporating embedded floating point units. Additionally this paper describes a comparative study of the design of single precision and double precision pipelined floating point arithmetic units for evaluating expressions. The modules are designed using VHDL simulation in Xilinx software and implemented on VIRTEX and SPARTAN FPGAs.


Electronics ◽  
2021 ◽  
Vol 11 (1) ◽  
pp. 39
Author(s):  
Ioannis Stratakos ◽  
Vasileios Leon ◽  
Giorgos Armeniakos ◽  
George Lentaris ◽  
Dimitrios Soudris

Every new generation of wireless communication standard aims to improve the overall performance and quality of service (QoS), compared to the previous generations. Increased data rates, numbers and capabilities of connected devices, new applications, and higher data volume transfers are some of the key parameters that are of interest. To satisfy these increased requirements, the synergy between wireless technologies and optical transport will dominate the 5G network topologies. This work focuses on a fundamental digital function in an orthogonal frequency-division multiplexing (OFDM) baseband transceiver architecture and aims at improving the throughput and circuit complexity of this function. Specifically, we consider the high-order QAM demodulation and apply approximation techniques to achieve our goals. We adopt approximate computing as a design strategy to exploit the error resiliency of the QAM function and deliver significant gains in terms of critical performance metrics. Particularly, we take into consideration and explore four demodulation algorithms and develop accurate floating- and fixed-point circuits in VHDL. In addition, we further explore the effects of introducing approximate arithmetic components. For our test case, we consider 64-QAM demodulators, and the results suggest that the most promising design provides bit error rates (BER) ranging from 10−1 to 10−4 for SNR 0–14 dB in terms of accuracy. Targeting a Xilinx Zynq Ultrascale+ ZCU106 (XCZU7EV) FPGA device, the approximate circuits achieve up to 98% reduction in LUT utilization, compared to the accurate floating-point model of the same algorithm, and up to a 122% increase in operating frequency. In terms of power consumption, our most efficient circuit configurations consume 0.6–1.1 W when operating at their maximum clock frequency. Our results show that if the objective is to achieve high accuracy in terms of BER, the prevailing solution is the approximate LLR algorithm configured with fixed-point arithmetic and 8-bit truncation, providing 81% decrease in LUTs and 13% increase in frequency and sustains a throughput of 323 Msamples/s.


Sign in / Sign up

Export Citation Format

Share Document