An Automated Fixed-Point Optimization Tool in MATLAB XSG/SynDSP Environment

ISRN Signal Processing ◽

10.5402/2011/414293 ◽

2011 ◽

Vol 2011 ◽

pp. 1-17 ◽

Cited By ~ 10

Author(s):

Cheng C. Wang ◽

Changchun Shi ◽

Robert W. Brodersen ◽

Dejan Marković

Keyword(s):

Fixed Point ◽

Low Power ◽

High Performance ◽

Floating Point ◽

View Point ◽

Hardware Cost ◽

System Generator ◽

Area Estimation ◽

Complex Optimization ◽

Automated Tool

This paper presents an automated tool for floating-point to fixed-point conversion. The tool is based on previous work that was built in MATLAB/Simulink environment and Xilinx System Generator support. The tool is now extended to include Synplify DSP blocksets in a seamless way from the users' view point. In addition to FPGA area estimation, the tool now also includes ASIC area estimation for end-users who choose the ASIC flow. The tool minimizes hardware cost subject to mean-squared quantization error (MSE) constraints. To obtain more accurate ASIC area estimations with synthesized results, 3 performance levels are available to choose from, suitable for high-performance, typical, or low-power applications. The use of the tool is first illustrated on an FIR filter to achieve over 50% area savings for MSE specification of 10−6 as compared to all 16-bit realization. More complex optimization results for chip-level designs are also demonstrated.

Download Full-text

Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0052 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190052 ◽

Cited By ~ 4

Author(s):

Michael Hopkins ◽

Mantas Mikaitis ◽

Dave R. Lester ◽

Steve Furber

Keyword(s):

Fixed Point ◽

Differential Equations ◽

Ordinary Differential Equations ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Least Significant Bit ◽

Fixed Point Arithmetic ◽

Solution Algorithms ◽

Point Arithmetic

Although double-precision floating-point arithmetic currently dominates high-performance computing, there is increasing interest in smaller and simpler arithmetic types. The main reasons are potential improvements in energy efficiency and memory footprint and bandwidth. However, simply switching to lower-precision types typically results in increased numerical errors. We investigate approaches to improving the accuracy of reduced-precision fixed-point arithmetic types, using examples in an important domain for numerical computation in neuroscience: the solution of ordinary differential equations (ODEs). The Izhikevich neuron model is used to demonstrate that rounding has an important role in producing accurate spike timings from explicit ODE solution algorithms. In particular, fixed-point arithmetic with stochastic rounding consistently results in smaller errors compared to single-precision floating-point and fixed-point arithmetic with round-to-nearest across a range of neuron behaviours and ODE solvers. A computationally much cheaper alternative is also investigated, inspired by the concept of dither that is a widely understood mechanism for providing resolution below the least significant bit in digital signal processing. These results will have implications for the solution of ODEs in other subject areas, and should also be directly relevant to the huge range of practical problems that are represented by partial differential equations. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3474597 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-21

Author(s):

Chen Wu ◽

Mingyu Wang ◽

Xinyuan Chu ◽

Kun Wang ◽

Lei He

Keyword(s):

Fixed Point ◽

High Performance ◽

Good Accuracy ◽

Data Representation ◽

Floating Point ◽

Average Throughput ◽

Precision Data ◽

Content Type ◽

Point Arithmetic ◽

Better Than

Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5 , respectively.

Download Full-text

High Performance and Low Power Fixed-point Special Function Unit for Mobile Vertex Processors

JOURNAL OF ELECTRONICS INFORMATION TECHNOLOGY ◽

10.3724/sp.j.1146.2011.00480 ◽

2011 ◽

Vol 33 (11) ◽

pp. 2764-2770 ◽

Cited By ~ 1

Author(s):

Ji-ye Jiao ◽

Rong Mu ◽

Yue Hao ◽

You-yao Liu

Keyword(s):

Fixed Point ◽

Low Power ◽

High Performance ◽

Special Function ◽

Function Unit

Download Full-text

A low-power carry cut-back approximate adder with fixed-point implementation and floating-point precision

Proceedings of the 53rd Annual Design Automation Conference on - DAC '16 ◽

10.1145/2897937.2897964 ◽

2016 ◽

Cited By ~ 6

Author(s):

Vincent Camus ◽

Jeremy Schlachter ◽

Christian Enz

Keyword(s):

Fixed Point ◽

Low Power ◽

Floating Point

Download Full-text

Reduced Area and Low Power Implementation of FFT/IFFT Processor

Iraqi Journal for Electrical And Electronic Engineering ◽

10.37917/ijeee.14.2.3 ◽

2018 ◽

Vol 14 (2) ◽

pp. 108-119 ◽

Cited By ~ 1

Author(s):

Shefa Dawwd ◽

Suha Nori

Keyword(s):

Fixed Point ◽

Shared Memory ◽

Digital Signal ◽

Processing Unit ◽

Memory Location ◽

Hardware Cost ◽

System Generator ◽

Place Memory ◽

Input Sample ◽

Memory Scheme

The Fast Fourier Transform (FFT) and Inverse FFT(IFFT) are used in most of the digital signal processing applications. Real time implementation of FFT/IFFT is required in many of these applications. In this paper, an FPGA reconfigurable fixed point implementation of FFT/IFFT is presented. A manually VHDL codes are written to model the proposed FFT/IFFT processor. Two CORDIC-based FFT/IFFT processors based on radix-2and radix-4 architecture are designed. They have one butterfly processing unit. An efficient In-place memory assignment and addressing for the shared memory of FFT/IFFT processors are proposed to reduce the complexity of memory scheme. With "in-place" strategy, the outputs of butterfly operation are stored back to the same memory location of the inputs. Because of using DIF FFT, the output was to be in reverse order. To solve this issue, we have re-use the block RAM that used for storing the input sample as reordering unit to reduce hardware cost of the proposed processor. The Spartan-3E FPGA of 500,000 gates is employed to synthesize and implement the proposed architecture. The CORDIC based processors can save 40% of power consumption as compared with Xilinx logic core architectures of system generator.

Download Full-text

A self control strategy for a delta inverter fed BDCM drive using Xilinx system generator with fixed point/floating point mode

2016 17th International Conference on Sciences and Techniques of Automatic Control and Computer Engineering (STA) ◽

10.1109/sta.2016.7952033 ◽

2016 ◽

Cited By ~ 1

Author(s):

Asma Alouane ◽

Asma Ben Rhouma ◽

Adel Khedher

Keyword(s):

Fixed Point ◽

Control Strategy ◽

Self Control ◽

Floating Point ◽

System Generator ◽

Xilinx System Generator

Download Full-text

Design of a Low Power, High Performance BICMOS Current-limiting Circuit for DC-DC Converter Application

PIERS Online ◽

10.2529/piers060817034009 ◽

2007 ◽

Vol 3 (4) ◽

pp. 368-373 ◽

Cited By ~ 5

Author(s):

Hongbo Ma ◽

Quanyuan Feng

Keyword(s):

Low Power ◽

High Performance ◽

Current Limiting

Download Full-text

Efficient Instruction and Data Caching for High Performance Embedded Processors

Jornada de Jóvenes Investigadores del I3A ◽

10.26754/jji-i3a.201201788 ◽

1970 ◽

pp. 9

Author(s):

A. Ferrerón Labari ◽

D. Suárez Gracia ◽

V. Viñals Yúfera

Keyword(s):

Embedded Systems ◽

Power Consumption ◽

Low Power ◽

Interconnection Networks ◽

High Performance ◽

Critical Issue ◽

Content Management ◽

Structure Design ◽

Portable Devices ◽

On Chip

In the last years, embedded systems have evolved so that they offer capabilities we could only find before in high performance systems. Portable devices already have multiprocessors on-chip (such as PowerPC 476FP or ARM Cortex A9 MP), usually multi-threaded, and a powerful multi-level cache memory hierarchy on-chip. As most of these systems are battery-powered, the power consumption becomes a critical issue. Achieving high performance and low power consumption is a high complexity challenge where some proposals have been already made. Suarez et al. proposed a new cache hierarchy on-chip, the LP-NUCA (Low Power NUCA), which is able to reduce the access latency taking advantage of NUCA (Non-Uniform Cache Architectures) properties. The key points are decoupling the functionality, and utilizing three specialized networks on-chip. This structure has been proved to be efficient for data hierarchies, achieving a good performance and reducing the energy consumption. On the other hand, instruction caches have different requirements and characteristics than data caches, contradicting the low-power embedded systems requirements, especially in SMT (simultaneous multi-threading) environments. We want to study the benefits of utilizing small tiled caches for the instruction hierarchy, so we propose a new design, ID-LP-NUCAs. Thus, we need to re-evaluate completely our previous design in terms of structure design, interconnection networks (including topologies, flow control and routing), content management (with special interest in hardware/software content allocation policies), and structure sharing. In CMP environments (chip multiprocessors) with parallel workloads, coherence plays an important role, and must be taken into consideration.

Download Full-text

High-Performance and Low-Power Full Color Reflective LCD for New Applications

Proceedings of the International Display Workshops ◽

10.36463/idw.2019.1411 ◽

2019 ◽

pp. 1411

Author(s):

Hiroyuki Hakoi ◽

Ming Ni ◽

Junichi Hashimoto ◽

Takashi Sato ◽

Shinji Shimada ◽

...

Keyword(s):

Low Power ◽

High Performance ◽

Full Color ◽

New Applications

Download Full-text

Performance Analysis of Various Multipliers Using 8T-full Adder with 180nm Technology

Recent Advances in Electrical & Electronic Engineering (Formerly Recent Patents on Electrical & Electronic Engineering) ◽

10.2174/2352096513666200107091932 ◽

2020 ◽

Vol 13 (6) ◽

pp. 864-870

Author(s):

Sai Venkatramana Prasada G.S ◽

G. Seshikala ◽

S. Niranjana

Keyword(s):

Low Power ◽

Power Dissipation ◽

High Speed ◽

High Performance ◽

Full Adder ◽

Fundamental Operation ◽

Wallace Tree ◽

Power Delay Product ◽

The Comparative Study ◽

Wallace Tree Multiplier

Background: This paper presents the comparative study of power dissipation, delay and power delay product (PDP) of different full adders and multiplier designs. Methods: Full adder is the fundamental operation for any processors, DSP architectures and VLSI systems. Here ten different full adder structures were analyzed for their best performance using a Mentor Graphics tool with 180nm technology. Results: From the analysis result high performance full adder is extracted for further higher level designs. 8T full adder exhibits high speed, low power delay and low power delay product and hence it is considered to construct four different multiplier designs, such as Array multiplier, Baugh Wooley multiplier, Braun multiplier and Wallace Tree multiplier. These different structures of multipliers were designed using 8T full adder and simulated using Mentor Graphics tool in a constant W/L aspect ratio. Conclusion: From the analysis, it is concluded that Wallace Tree multiplier is the high speed multiplier but dissipates comparatively high power. Baugh Wooley multiplier dissipates less power but exhibits more time delay and low PDP.

Download Full-text