scholarly journals An Automated Fixed-Point Optimization Tool in MATLAB XSG/SynDSP Environment

2011 ◽  
Vol 2011 ◽  
pp. 1-17 ◽  
Author(s):  
Cheng C. Wang ◽  
Changchun Shi ◽  
Robert W. Brodersen ◽  
Dejan Marković

This paper presents an automated tool for floating-point to fixed-point conversion. The tool is based on previous work that was built in MATLAB/Simulink environment and Xilinx System Generator support. The tool is now extended to include Synplify DSP blocksets in a seamless way from the users' view point. In addition to FPGA area estimation, the tool now also includes ASIC area estimation for end-users who choose the ASIC flow. The tool minimizes hardware cost subject to mean-squared quantization error (MSE) constraints. To obtain more accurate ASIC area estimations with synthesized results, 3 performance levels are available to choose from, suitable for high-performance, typical, or low-power applications. The use of the tool is first illustrated on an FIR filter to achieve over 50% area savings for MSE specification of 10−6 as compared to all 16-bit realization. More complex optimization results for chip-level designs are also demonstrated.

Author(s):  
Michael Hopkins ◽  
Mantas Mikaitis ◽  
Dave R. Lester ◽  
Steve Furber

Although double-precision floating-point arithmetic currently dominates high-performance computing, there is increasing interest in smaller and simpler arithmetic types. The main reasons are potential improvements in energy efficiency and memory footprint and bandwidth. However, simply switching to lower-precision types typically results in increased numerical errors. We investigate approaches to improving the accuracy of reduced-precision fixed-point arithmetic types, using examples in an important domain for numerical computation in neuroscience: the solution of ordinary differential equations (ODEs). The Izhikevich neuron model is used to demonstrate that rounding has an important role in producing accurate spike timings from explicit ODE solution algorithms. In particular, fixed-point arithmetic with stochastic rounding consistently results in smaller errors compared to single-precision floating-point and fixed-point arithmetic with round-to-nearest across a range of neuron behaviours and ODE solvers. A computationally much cheaper alternative is also investigated, inspired by the concept of dither that is a widely understood mechanism for providing resolution below the least significant bit in digital signal processing. These results will have implications for the solution of ODEs in other subject areas, and should also be directly relevant to the huge range of practical problems that are represented by partial differential equations. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.


2022 ◽  
Vol 15 (1) ◽  
pp. 1-21
Author(s):  
Chen Wu ◽  
Mingyu Wang ◽  
Xinyuan Chu ◽  
Kun Wang ◽  
Lei He

Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5 , respectively.


2018 ◽  
Vol 14 (2) ◽  
pp. 108-119 ◽  
Author(s):  
Shefa Dawwd ◽  
Suha Nori

The Fast Fourier Transform (FFT) and Inverse FFT(IFFT) are used in most of the digital signal processing applications. Real time implementation of FFT/IFFT is required in many of these applications. In this paper, an FPGA reconfigurable fixed point implementation of FFT/IFFT is presented. A manually VHDL codes are written to model the proposed FFT/IFFT processor. Two CORDIC-based FFT/IFFT processors based on radix-2and radix-4 architecture are designed. They have one butterfly processing unit. An efficient In-place memory assignment and addressing for the shared memory of FFT/IFFT processors are proposed to reduce the complexity of memory scheme. With "in-place" strategy, the outputs of butterfly operation are stored back to the same memory location of the inputs. Because of using DIF FFT, the output was to be in reverse order. To solve this issue, we have re-use the block RAM that used for storing the input sample as reordering unit to reduce hardware cost of the proposed processor. The Spartan-3E FPGA of 500,000 gates is employed to synthesize and implement the proposed architecture. The CORDIC based processors can save 40% of power consumption as compared with Xilinx logic core architectures of system generator.


Author(s):  
A. Ferrerón Labari ◽  
D. Suárez Gracia ◽  
V. Viñals Yúfera

In the last years, embedded systems have evolved so that they offer capabilities we could only find before in high performance systems. Portable devices already have multiprocessors on-chip (such as PowerPC 476FP or ARM Cortex A9 MP), usually multi-threaded, and a powerful multi-level cache memory hierarchy on-chip. As most of these systems are battery-powered, the power consumption becomes a critical issue. Achieving high performance and low power consumption is a high complexity challenge where some proposals have been already made. Suarez et al. proposed a new cache hierarchy on-chip, the LP-NUCA (Low Power NUCA), which is able to reduce the access latency taking advantage of NUCA (Non-Uniform Cache Architectures) properties. The key points are decoupling the functionality, and utilizing three specialized networks on-chip. This structure has been proved to be efficient for data hierarchies, achieving a good performance and reducing the energy consumption. On the other hand, instruction caches have different requirements and characteristics than data caches, contradicting the low-power embedded systems requirements, especially in SMT (simultaneous multi-threading) environments. We want to study the benefits of utilizing small tiled caches for the instruction hierarchy, so we propose a new design, ID-LP-NUCAs. Thus, we need to re-evaluate completely our previous design in terms of structure design, interconnection networks (including topologies, flow control and routing), content management (with special interest in hardware/software content allocation policies), and structure sharing. In CMP environments (chip multiprocessors) with parallel workloads, coherence plays an important role, and must be taken into consideration.


Author(s):  
Hiroyuki Hakoi ◽  
Ming Ni ◽  
Junichi Hashimoto ◽  
Takashi Sato ◽  
Shinji Shimada ◽  
...  

Author(s):  
Sai Venkatramana Prasada G.S ◽  
G. Seshikala ◽  
S. Niranjana

Background: This paper presents the comparative study of power dissipation, delay and power delay product (PDP) of different full adders and multiplier designs. Methods: Full adder is the fundamental operation for any processors, DSP architectures and VLSI systems. Here ten different full adder structures were analyzed for their best performance using a Mentor Graphics tool with 180nm technology. Results: From the analysis result high performance full adder is extracted for further higher level designs. 8T full adder exhibits high speed, low power delay and low power delay product and hence it is considered to construct four different multiplier designs, such as Array multiplier, Baugh Wooley multiplier, Braun multiplier and Wallace Tree multiplier. These different structures of multipliers were designed using 8T full adder and simulated using Mentor Graphics tool in a constant W/L aspect ratio. Conclusion: From the analysis, it is concluded that Wallace Tree multiplier is the high speed multiplier but dissipates comparatively high power. Baugh Wooley multiplier dissipates less power but exhibits more time delay and low PDP.


Sign in / Sign up

Export Citation Format

Share Document