Architecture Exploration of High-Performance Floating-Point Fused Multiply-Add Units and their Automatic Use in High-Level Synthesis

Due to performance and energy requirements, FPGA-based accelerators have become a promising solution for high-performance computations. Meanwhile, with the help of high-level synthesis (HLS) compilers, FPGA can be programmed using common programming languages such as C, C++, or OpenCL, thereby improving design efficiency and portability. Stencil computations are significant kernels in various scientific applications. In this paper, we introduce an architecture design for implementing stencil kernels on state-of-the-art FPGA with high bandwidth memory (HBM). Traditional FPGAs are usually equipped with external memory, e.g., DDR3 or DDR4, which limits the design space exploration in the spatial domain of stencil kernels. Therefore, many previous studies mainly relied on exploiting parallelism in the temporal domain to eliminate the bandwidth limitations. In our approach, we scale-up the design performance by considering both the spatial and temporal parallelism of the stencil kernel equally. We also discuss the design portability among different HLS compilers. We use typical stencil kernels to evaluate our design on a Xilinx U280 FPGA board and compare the results with other existing studies. By adopting our method, developers can take broad parallelization strategies based on specific FPGA resources to improve performance.

Download Full-text

Buffer Placement and Sizing for High-Performance Dataflow Circuits

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3477053 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-32

Author(s):

Lana Josipović ◽

Shabnam Sheikhha ◽

Andrea Guerrieri ◽

Paolo Ienne ◽

Jordi Cortadella

Keyword(s):

Performance Optimization ◽

Optimization Model ◽

High Performance ◽

Control Flow ◽

High Level Synthesis ◽

Software Applications ◽

Marked Graphs ◽

Variable Latency ◽

High Level ◽

Strong Contrast

Commercial high-level synthesis tools typically produce statically scheduled circuits. Yet, effective C-to-circuit conversion of arbitrary software applications calls for dataflow circuits, as they can handle efficiently variable latencies (e.g., caches), unpredictable memory dependencies, and irregular control flow. Dataflow circuits exhibit an unconventional property: registers (usually referred to as “buffers”) can be placed anywhere in the circuit without changing its semantics, in strong contrast to what happens in traditional datapaths. Yet, although functionally irrelevant, this placement has a significant impact on the circuit’s timing and throughput. In this work, we show how to strategically place buffers into a dataflow circuit to optimize its performance. Our approach extracts a set of choice-free critical loops from arbitrary dataflow circuits and relies on the theory of marked graphs to optimize the buffer placement and sizing. Our performance optimization model supports important high-level synthesis features such as pipelined computational units, units with variable latency and throughput, and if-conversion. We demonstrate the performance benefits of our approach on a set of dataflow circuits obtained from imperative code.

Download Full-text

Area Optimization of Combined Integer and Floating Point Circuits in High-Level Synthesis

2008 4th Southern Conference on Programmable Logic ◽

10.1109/spl.2008.4547764 ◽

2008 ◽

Cited By ~ 2

Author(s):

Esther Andres ◽

Maria C. Molina ◽

Guillermo Botella ◽

Alberto del Barrio ◽

Jose M. Mendias

Keyword(s):

High Level Synthesis ◽

Floating Point ◽

Area Optimization ◽

High Level

Download Full-text

Templatised Soft Floating-Point for High-Level Synthesis

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) ◽

10.1109/fccm.2019.00038 ◽

2019 ◽

Cited By ~ 4

Author(s):

David B. Thomas

Keyword(s):

High Level Synthesis ◽

Floating Point ◽

High Level

Download Full-text

Apply high-level synthesis design and verification methodology on floating-point unit implementation

Technical Papers of 2014 International Symposium on VLSI Design, Automation and Test ◽

10.1109/vlsi-dat.2014.6834921 ◽

2014 ◽

Cited By ~ 2

Author(s):

Chia-I Chen ◽

Chin-Yeh Yu ◽

Yen-Ju Lu ◽

Chi-Feng Wu

Keyword(s):

High Level Synthesis ◽

Floating Point ◽

Synthesis Design ◽

Floating Point Unit ◽

High Level ◽

Verification Methodology

Download Full-text

High-Level Synthesis under Fixed-Point Accuracy Constraint

Journal of Electrical and Computer Engineering ◽

10.1155/2012/906350 ◽

2012 ◽

Vol 2012 ◽

pp. 1-14 ◽

Cited By ~ 6

Author(s):

Daniel Menard ◽

Nicolas Herve ◽

Olivier Sentieys ◽

Hai-Nam Nguyen

Keyword(s):

Signal Processing ◽

Fixed Point ◽

Word Length ◽

High Level Synthesis ◽

Floating Point ◽

Fixed Point Arithmetic ◽

Resource Binding ◽

High Level ◽

Point Arithmetic ◽

Length Optimization

Implementing signal processing applications in embedded systems generally requires the use of fixed-point arithmetic. The main problem slowing down the hardware implementation flow is the lack of high-level development tools to target these architectures from algorithmic specification language using floating-point data types. In this paper, a new method to automatically implement a floating-point algorithm into an FPGA or an ASIC using fixed-point arithmetic is proposed. An iterative process on high-level synthesis and data word-length optimization is used to improve both of these dependent processes. Indeed, high-level synthesis requires operator word-length knowledge to correctly execute its allocation, scheduling, and resource binding steps. Moreover, the word-length optimization requires resource binding and scheduling information to correctly group operations. To dramatically reduce the optimization time compared to fixed-point simulation-based methods, the accuracy evaluation is done through an analytical method. Different experiments on signal processing algorithms are presented to show the efficiency of the proposed method. Compared to classical methods, the average architecture area reduction is between 10% and 28%.

Download Full-text