Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

2021 ◽  
Vol 47 (2) ◽  
pp. 1-26
Author(s):  
Field G. Van Zee ◽  
Devangi N. Parikh ◽  
Robert A. Van De Geijn

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

2010 ◽  
Vol 38 (3-4) ◽  
pp. 322-338 ◽  
Author(s):  
Vinay B. Y. Kumar ◽  
Siddharth Joshi ◽  
Sachin B. Patkar ◽  
H. Narayanan

Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.


Author(s):  
P. Raghu ◽  
K. Sriram

Grid computing is a special type of parallel computing, which allows us to unite pools of servers, storage systems, and networks into a single large virtual super computer. Grid computing has the advantages of solving complex problems in a shorter time and also makes better use of the existing hardware. It can take advantage of underutilized resources to meet business requirements while minimizing additional costs. There are many Grid setup tools available. In this paper, Globus Toolkit, an open source tool for grid enabled applications, is considered. Initially grid is established between two systems running Linux, using Globus Toolkit. A simple matrix multiplication program, which is capable of running both in grid and stand alone systems, is developed. The application is executed in single system varying the order of the matrices. The same application is split into two sub jobs and run on two grid machines with different orders. Finally the results of the execution are compares and the results are presented in graphs. The work can be extended further to find the type of parallelizing suitable for the application developed. Similarly, FP tree algorithm is taken and the data sets are fed into different machine and in stand alone system. A suitable load balancing mechanism for grid application is discussed. The sections in the paper are arranged as following; Introduction to Grid, Grid setup using Globus toolkit, splitting of the matrix application, FP tree algorithm, performance results, future works, conclusion and references.


Author(s):  
Michael O Lam ◽  
Jeffrey K Hollingsworth

Floating-point computation is ubiquitous in high-performance scientific computing, but rounding error can compromise the results of extended calculations, especially at large scales. In this paper, we present new techniques that use binary instrumentation and modification to do fine-grained floating-point precision analysis, simulating any level of precision less than or equal to the precision of the original program. These techniques have an average of 40–70% lower overhead and provide more fine-grained insights into a program’s sensitivity than previous mixed-precision analyses. We also present a novel histogram-based visualization of a program’s floating-point precision sensitivity, as well as an incremental search technique that allows developers to incrementally trade off analysis time for detail, including the ability to restart analyses from where they left off. We present results from several case studies and experiments that show the efficacy of these techniques. Using our tool and its novel visualization, application developers can more quickly determine for specific data sets whether their application could be run using fewer double precision variables, saving both time and memory space.


2021 ◽  
Vol 7 ◽  
pp. e330
Author(s):  
Massimiliano Fasi ◽  
Nicholas J. Higham ◽  
Mantas Mikaitis ◽  
Srikara Pranesh

We explore the floating-point arithmetic implemented in the NVIDIA tensor cores, which are hardware accelerators for mixed-precision matrix multiplication available on the Volta, Turing, and Ampere microarchitectures. Using Volta V100, Turing T4, and Ampere A100 graphics cards, we determine what precision is used for the intermediate results, whether subnormal numbers are supported, what rounding mode is used, in which order the operations underlying the matrix multiplication are performed, and whether partial sums are normalized. These aspects are not documented by NVIDIA, and we gain insight by running carefully designed numerical experiments on these hardware units. Knowing the answers to these questions is important if one wishes to: (1) accurately simulate NVIDIA tensor cores on conventional hardware; (2) understand the differences between results produced by code that utilizes tensor cores and code that uses only IEEE 754-compliant arithmetic operations; and (3) build custom hardware whose behavior matches that of NVIDIA tensor cores. As part of this work we provide a test suite that can be easily adapted to test newer versions of the NVIDIA tensor cores as well as similar accelerators from other vendors, as they become available. Moreover, we identify a non-monotonicity issue affecting floating point multi-operand adders if the intermediate results are not normalized after each step.


1969 ◽  
Vol 10 (1) ◽  
pp. 60-65 ◽  
Author(s):  
J. H. Bevis ◽  
C. K. Martin

In this paper we consider mappings induced by matrix multiplication which are defined on lattices of matrices whose coordinates come from a fixed orthomodular lattice L (i.e. a lattice with an orthocomplementation denoted by ′ in which a ≦ b ⇒ a ∨ (a′ ∧ b) = b). will denote the set of all m × n matrices over L with partial order and lattice operations defined coordinatewise. For conformal matrices A and B the (i,j)th coordinate of the matrix product AB is defined to be (AB)ij = Vk(Aik ∧ BkJ). We assume familiarity with the notation and results of [1]. is an orthomodular lattice and the (lattice) centre of is defined as , where we say that A commutes with B and write . In § 1 it is shown that mappings from into characterized by right multiplication X → XP (P ∈ ) are residuated if and only if p ∈ ℘ (). (Similarly for left multiplication.) This result is used to show the existence of residuated pairs. Hence, in § 2 we are able to extend a result of Blyth [3] which relates invertible and cancellable matrices (see Theorem 3 and its corollaries). Finally, for right (left) multiplication mappings, characterizations are given in § 3 for closure operators, quantifiers, range closed mappings, and Sasaki projections.


Sign in / Sign up

Export Citation Format

Share Document