High-Performance Mixed-Precision Linear Solver for FPGAs

Junqing Sun; G.D. Peterson; O.O. Storaasli

doi:10.1109/tc.2008.89

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3467476 ◽

2021 ◽

Vol 14 (4) ◽

pp. 1-28

Author(s):

Tao Yang ◽

Zhezhi He ◽

Tengchuan Kou ◽

Qingzheng Li ◽

Qi Han ◽

...

Keyword(s):

High Performance ◽

State Of The Art ◽

The State ◽

Optimization Approach ◽

Quantization Scheme ◽

Model Accuracy ◽

Sparsity Pattern ◽

Computing Platform ◽

Energy Efficiency Improvement ◽

Mixed Precision

Field-programmable Gate Array (FPGA) is a high-performance computing platform for Convolution Neural Networks (CNNs) inference. Winograd algorithm, weight pruning, and quantization are widely adopted to reduce the storage and arithmetic overhead of CNNs on FPGAs. Recent studies strive to prune the weights in the Winograd domain, however, resulting in irregular sparse patterns and leading to low parallelism and reduced utilization of resources. Besides, there are few works to discuss a suitable quantization scheme for Winograd. In this article, we propose a regular sparse pruning pattern in the Winograd-based CNN, namely, Sub-row-balanced Sparsity (SRBS) pattern, to overcome the challenge of the irregular sparse pattern. Then, we develop a two-step hardware co-optimization approach to improve the model accuracy using the SRBS pattern. Based on the pruned model, we implement a mixed precision quantization to further reduce the computational complexity of bit operations. Finally, we design an FPGA accelerator that takes both the advantage of the SRBS pattern to eliminate low-parallelism computation and the irregular memory accesses, as well as the mixed precision quantization to get a layer-wise bit width. Experimental results on VGG16/VGG-nagadomi with CIFAR-10 and ResNet-18/34/50 with ImageNet show up to 11.8×/8.67× and 8.17×/8.31×/10.6× speedup, 12.74×/9.19× and 8.75×/8.81×/11.1× energy efficiency improvement, respectively, compared with the state-of-the-art dense Winograd accelerator [20] with negligible loss of model accuracy. We also show that our design has 4.11× speedup compared with the state-of-the-art sparse Winograd accelerator [19] on VGG16.

Download Full-text

DrMP: Mixed Precision-Aware DRAM for High Performance Approximate and Precise Computing

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) ◽

10.1109/pact.2017.34 ◽

2017 ◽

Cited By ~ 5

Author(s):

Xianwei Zhang ◽

Youtao Zhang ◽

Bruce R. Childers ◽

Jun Yang

Keyword(s):

High Performance ◽

Mixed Precision

Download Full-text

Fast reconstruction tools for ptychography at Sirius, the fourth-generation Brazilian synchrotron

Journal of Applied Crystallography ◽

10.1107/s1600576720013886 ◽

2020 ◽

Vol 53 (6) ◽

pp. 1550-1558

Author(s):

Giovanni L. Baraldi ◽

Carlos S. B. Dias ◽

Francisco M. C. Silva ◽

Hélio C. N. Tolentino ◽

Eduardo X. Miqueles

Keyword(s):

High Speed ◽

High Performance ◽

Phase Retrieval ◽

Scattering Data ◽

Fourth Generation ◽

Acquisition Time ◽

Optimization Strategy ◽

X Ray ◽

Mixed Precision ◽

Synchrotron Light

Described here are image reconstruction optimizations for ptychographic coherent X-ray scattering data and X-ray fluorescence, which have been developed for the new fourth-generation synchrotron light source, Sirius, at the Brazilian Synchrotron Light Laboratory. The optimization strategy has been applied to the standard experimental strategy for ptychographic and fluorescence experiments on the Carnaúba beamline which involves the use of high-speed continuous scans (fly scans) for a fast acquisition time over large areas through the use of a newly proposed trajectory named the alternating linear trajectory. The scientific computing developments presented here target an efficient use of graphical processing units (GPUs) to the point where large fly-scan acquisitions can be processed in real time on a local high-performance computer. Some optimizations involving a custom fast Fourier transform implementation and use of mixed precision can be applied to other algorithms and phase-retrieval techniques, and therefore this work provides a general optimization scheme. Finally, the optimization strategy presented here has improved performance by a factor of ∼2.5 times faster when compared with non-optimized GPU implementations.

Download Full-text

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

ACM Transactions on Mathematical Software ◽

10.1145/3402225 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-26

Author(s):

Field G. Van Zee ◽

Devangi N. Parikh ◽

Robert A. Van De Geijn

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Software Framework ◽

Matrix Product ◽

Double Precision ◽

Precision Matrix ◽

Implementation Approach ◽

Mixed Precision ◽

The Matrix ◽

Performance Results

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

Download Full-text

Fine-grained floating-point precision analysis

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016652462 ◽

2016 ◽

Vol 32 (2) ◽

pp. 231-245 ◽

Cited By ~ 10

Author(s):

Michael O Lam ◽

Jeffrey K Hollingsworth

Keyword(s):

High Performance ◽

Floating Point ◽

Data Sets ◽

Double Precision ◽

Precision Analysis ◽

Memory Space ◽

Incremental Search ◽

Fine Grained ◽

Mixed Precision ◽

Application Developers

Floating-point computation is ubiquitous in high-performance scientific computing, but rounding error can compromise the results of extended calculations, especially at large scales. In this paper, we present new techniques that use binary instrumentation and modification to do fine-grained floating-point precision analysis, simulating any level of precision less than or equal to the precision of the original program. These techniques have an average of 40–70% lower overhead and provide more fine-grained insights into a program’s sensitivity than previous mixed-precision analyses. We also present a novel histogram-based visualization of a program’s floating-point precision sensitivity, as well as an incremental search technique that allows developers to incrementally trade off analysis time for detail, including the ability to restart analyses from where they left off. We present results from several case studies and experiments that show the efficacy of these techniques. Using our tool and its novel visualization, application developers can more quickly determine for specific data sets whether their application could be run using fewer double precision variables, saving both time and memory space.

Download Full-text