Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

Field G. Van Zee; Devangi N. Parikh; Robert A. Van De Geijn

doi:10.1145/3402225

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

ACM Transactions on Mathematical Software ◽

10.1145/3402225 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-26

Author(s):

Field G. Van Zee ◽

Devangi N. Parikh ◽

Robert A. Van De Geijn

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Software Framework ◽

Matrix Product ◽

Double Precision ◽

Precision Matrix ◽

Implementation Approach ◽

Mixed Precision ◽

The Matrix ◽

Performance Results

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

Download Full-text

FPGA Based High Performance Double-Precision Matrix Multiplication

2009 22nd International Conference on VLSI Design ◽

10.1109/vlsi.design.2009.13 ◽

2009 ◽

Cited By ~ 20

Author(s):

Vinay B.Y. Kumar ◽

Siddharth Joshi ◽

Sachin B. Patkar ◽

H. Narayanan

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix

Download Full-text

FPGA Based High Performance Double-Precision Matrix Multiplication

International Journal of Parallel Programming ◽

10.1007/s10766-010-0131-8 ◽

2010 ◽

Vol 38 (3-4) ◽

pp. 322-338 ◽

Cited By ~ 18

Author(s):

Vinay B. Y. Kumar ◽

Siddharth Joshi ◽

Sachin B. Patkar ◽

H. Narayanan

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix

Download Full-text

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Electronics ◽

10.3390/electronics10161984 ◽

2021 ◽

Vol 10 (16) ◽

pp. 1984

Author(s):

Wei Zhang ◽

Zihao Jiang ◽

Zhiguang Chen ◽

Nong Xiao ◽

Yang Ou

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Multicore Processors ◽

Matrix Multiplication ◽

Memory Access ◽

Double Precision ◽

Competitive Performance ◽

General Matrix ◽

Remarkable Improvement ◽

Task Independence

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Download Full-text

Adaptation of Double-Precision Matrix Multiplication to the Cell Broadband Engine Architecture

Parallel Processing and Applied Mathematics - Lecture Notes in Computer Science ◽

10.1007/978-3-642-14390-8_56 ◽

2010 ◽

pp. 535-546 ◽

Cited By ~ 1

Author(s):

Krzysztof Rojek ◽

Łukasz Szustak

Keyword(s):

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix ◽

Cell Broadband Engine

Download Full-text

Dedicated architecture for double precision matrix multiplication in supercomputing environment

2007 IEEE Design and Diagnostics of Electronic Circuits and Systems ◽

10.1109/ddecs.2007.4295303 ◽

2007 ◽

Cited By ~ 1

Author(s):

P. Russek ◽

K. Wiatr

Keyword(s):

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix

Download Full-text

Performance Evaluation of Computation Intensive Tasks in Grid

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2014.1214 ◽

2014 ◽

pp. 22-26

Author(s):

P. Raghu ◽

K. Sriram

Keyword(s):

Grid Computing ◽

Matrix Multiplication ◽

Data Sets ◽

Globus Toolkit ◽

Tree Algorithm ◽

The Matrix ◽

Super Computer ◽

Simple Matrix ◽

Performance Results ◽

Business Requirements

Grid computing is a special type of parallel computing, which allows us to unite pools of servers, storage systems, and networks into a single large virtual super computer. Grid computing has the advantages of solving complex problems in a shorter time and also makes better use of the existing hardware. It can take advantage of underutilized resources to meet business requirements while minimizing additional costs. There are many Grid setup tools available. In this paper, Globus Toolkit, an open source tool for grid enabled applications, is considered. Initially grid is established between two systems running Linux, using Globus Toolkit. A simple matrix multiplication program, which is capable of running both in grid and stand alone systems, is developed. The application is executed in single system varying the order of the matrices. The same application is split into two sub jobs and run on two grid machines with different orders. Finally the results of the execution are compares and the results are presented in graphs. The work can be extended further to find the type of parallelizing suitable for the application developed. Similarly, FP tree algorithm is taken and the data sets are fed into different machine and in stand alone system. A suitable load balancing mechanism for grid application is discussed. The sections in the paper are arranged as following; Introduction to Grid, Grid setup using Globus toolkit, splitting of the matrix application, FP tree algorithm, performance results, future works, conclusion and references.

Download Full-text

Fine-grained floating-point precision analysis

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016652462 ◽

2016 ◽

Vol 32 (2) ◽

pp. 231-245 ◽

Cited By ~ 10

Author(s):

Michael O Lam ◽

Jeffrey K Hollingsworth

Keyword(s):

High Performance ◽

Floating Point ◽

Data Sets ◽

Double Precision ◽

Precision Analysis ◽

Memory Space ◽

Incremental Search ◽

Fine Grained ◽

Mixed Precision ◽

Application Developers

Floating-point computation is ubiquitous in high-performance scientific computing, but rounding error can compromise the results of extended calculations, especially at large scales. In this paper, we present new techniques that use binary instrumentation and modification to do fine-grained floating-point precision analysis, simulating any level of precision less than or equal to the precision of the original program. These techniques have an average of 40–70% lower overhead and provide more fine-grained insights into a program’s sensitivity than previous mixed-precision analyses. We also present a novel histogram-based visualization of a program’s floating-point precision sensitivity, as well as an incremental search technique that allows developers to incrementally trade off analysis time for detail, including the ability to restart analyses from where they left off. We present results from several case studies and experiments that show the efficacy of these techniques. Using our tool and its novel visualization, application developers can more quickly determine for specific data sets whether their application could be run using fewer double precision variables, saving both time and memory space.

Download Full-text

Numerical behavior of NVIDIA tensor cores

PeerJ Computer Science ◽

10.7717/peerj-cs.330 ◽

2021 ◽

Vol 7 ◽

pp. e330

Author(s):

Massimiliano Fasi ◽

Nicholas J. Higham ◽

Mantas Mikaitis ◽

Srikara Pranesh

Keyword(s):

Matrix Multiplication ◽

Floating Point ◽

Partial Sums ◽

Hardware Accelerators ◽

Floating Point Arithmetic ◽

Mixed Precision ◽

The Matrix ◽

Custom Hardware ◽

Intermediate Results ◽

Point Arithmetic

We explore the floating-point arithmetic implemented in the NVIDIA tensor cores, which are hardware accelerators for mixed-precision matrix multiplication available on the Volta, Turing, and Ampere microarchitectures. Using Volta V100, Turing T4, and Ampere A100 graphics cards, we determine what precision is used for the intermediate results, whether subnormal numbers are supported, what rounding mode is used, in which order the operations underlying the matrix multiplication are performed, and whether partial sums are normalized. These aspects are not documented by NVIDIA, and we gain insight by running carefully designed numerical experiments on these hardware units. Knowing the answers to these questions is important if one wishes to: (1) accurately simulate NVIDIA tensor cores on conventional hardware; (2) understand the differences between results produced by code that utilizes tensor cores and code that uses only IEEE 754-compliant arithmetic operations; and (3) build custom hardware whose behavior matches that of NVIDIA tensor cores. As part of this work we provide a test suite that can be easily adapted to test newer versions of the NVIDIA tensor cores as well as similar accelerators from other vendors, as they become available. Moreover, we identify a non-monotonicity issue affecting floating point multi-operand adders if the intermediate results are not normalized after each step.

Download Full-text

Residuation theory and matrix multiplication on orthomodular lattices

Glasgow Mathematical Journal ◽

10.1017/s0017089500000549 ◽

1969 ◽

Vol 10 (1) ◽

pp. 60-65 ◽

Cited By ~ 2

Author(s):

J. H. Bevis ◽

C. K. Martin

Keyword(s):

Partial Order ◽

Orthomodular Lattice ◽

Matrix Multiplication ◽

Matrix Product ◽

Closure Operators ◽

Orthomodular Lattices ◽

Left Multiplication ◽

The Matrix ◽

Residuation Theory ◽

Closed Mappings

In this paper we consider mappings induced by matrix multiplication which are defined on lattices of matrices whose coordinates come from a fixed orthomodular lattice L (i.e. a lattice with an orthocomplementation denoted by ′ in which a ≦ b ⇒ a ∨ (a′ ∧ b) = b). will denote the set of all m × n matrices over L with partial order and lattice operations defined coordinatewise. For conformal matrices A and B the (i,j)th coordinate of the matrix product AB is defined to be (AB)ij = Vk(Aik ∧ BkJ). We assume familiarity with the notation and results of [1]. is an orthomodular lattice and the (lattice) centre of is defined as , where we say that A commutes with B and write . In § 1 it is shown that mappings from into characterized by right multiplication X → XP (P ∈ ) are residuated if and only if p ∈ ℘ (). (Similarly for left multiplication.) This result is used to show the existence of residuated pairs. Hence, in § 2 we are able to extend a result of Blyth [3] which relates invertible and cancellable matrices (see Theorem 3 and its corollaries). Finally, for right (left) multiplication mappings, characterizations are given in § 3 for closure operators, quantifiers, range closed mappings, and Sasaki projections.

Download Full-text

Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing ◽

10.1016/j.parco.2011.08.006 ◽

2012 ◽

Vol 38 (4-5) ◽

pp. 260-276 ◽

Cited By ~ 12

Author(s):

Roman Wyrzykowski ◽

Krzysztof Rojek ◽

Lukasz Szustak

Keyword(s):

Matrix Multiplication ◽

Double Precision ◽

Cell Processor ◽

Processor Architecture ◽

Precision Matrix ◽

Model Driven

Download Full-text