Energy and performance tradeoffs for matrix multiplication on multicore machines

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

Parallel Processing Letters ◽

10.1142/s0129626410000090 ◽

2010 ◽

Vol 20 (02) ◽

pp. 103-121 ◽

Cited By ~ 1

Author(s):

MOSTAFA I. SOLIMAN ◽

ABDULMAJID F. Al-JUNAID

Keyword(s):

Performance Evaluation ◽

Matrix Multiplication ◽

General Purpose ◽

System Level ◽

Memory Latency ◽

Single Chip ◽

Wide Range ◽

Matrix Unit ◽

And Performance ◽

Vector Matrix

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.

Download Full-text

Mitigating State-Drift in Memristor Crossbar Arrays for Vector Matrix Multiplication

10.5772/intechopen.100246 ◽

2021 ◽

Author(s):

Amirali Amirsoleimani ◽

Tony Liu ◽

Fabien Alibart ◽

Serge Eccofey ◽

Yao-Feng Chang ◽

...

Keyword(s):

Matrix Multiplication ◽

Optimization Techniques ◽

Performance Improvements ◽

Network Applications ◽

Network Layers ◽

Adaptive Inference ◽

Computing Platforms ◽

And Performance ◽

Memristor Crossbar ◽

Vector Matrix

In this Chapter, we review the recent progress on resistance drift mitigation techniques for resistive switching memory devices (specifically memristors) and its impact on the accuracy in deep neural network applications. In the first section of the chapter, we investigate the importance of soft errors and their detrimental impact on memristor-based vector–matrix multiplication (VMM) platforms performance specially the memristance state-drift induced by long-term recurring inference operations with sub-threshold stress voltage. Also, we briefly review some currently developed state-drift mitigation methods. In the next section of the chapter, we will discuss an adaptive inference technique with low hardware overhead to mitigate the memristance drift in memristive VMM platform by using optimization techniques to adjust the inference voltage characteristic associated with different network layers. Also, we present simulation results and performance improvements achieved by applying the proposed inference technique by considering non-idealities for various deep network applications on memristor crossbar arrays. This chapter suggests that a simple low overhead inference technique can revive the functionality, enhance the performance of memristor-based VMM arrays and significantly increases their lifetime which can be a very important factor toward making this technology as a main stream player in future in-memory computing platforms.

Download Full-text

Study on Dense Matrix Multiplication Algorithms and Performance Evaluation of HPCC in 81 Nodes IBM Power 8 Architecture

10.9734/bpi/ramrcs/v5/14371d ◽

2021 ◽

pp. 105-125

Author(s):

Eduardo Patricio Estévez Ruiz ◽

Giovanny Eduardo Caluña Chicaiza ◽

Fabian Rodolfo Jiménez Patiño ◽

Joaquín Cayetano López Lago ◽

Saravana Prakash Thirumuruganandham

Keyword(s):

Performance Evaluation ◽

Matrix Multiplication ◽

Dense Matrix ◽

And Performance

Download Full-text

Dealing with performance/portability and performance/accuracy trade-offs in heterogeneous computing systems: a case study with matrix multiplication modulo primes

10.1117/12.919323 ◽

2012 ◽

Cited By ~ 1

Author(s):

Matthew Wezowicz ◽

B. David Saunder ◽

Michela Taufer

Keyword(s):

Heterogeneous Computing ◽

Matrix Multiplication ◽

Computing Systems ◽

Performance Portability ◽

Performance Accuracy ◽

Trade Offs ◽

And Performance ◽

Heterogeneous Computing Systems

Download Full-text

Performance Evaluation of Compiler Optimizations in FPGA Accelerators

10.5753/wscad.2019.8681 ◽

2019 ◽

Author(s):

Gustavo Leite ◽

Alexandro Baldassin ◽

Guido Araujo ◽

José Nelson Amaral

Keyword(s):

Data Transfer ◽

Matrix Multiplication ◽

Compiler Optimizations ◽

Performance Effect ◽

Code Transformations ◽

Performance Engineering ◽

Design Engineers ◽

Heterogeneous Architectures ◽

Comparable Performance ◽

And Performance

With the increasing power wall in microprocessor design, engineers shifted their attention to heterogeneous architectures, wherein several classes of devices are used for computation. Among them are FPGAs which offer comparable performance to CPUs while consuming only a fraction of energy. Despite the increasing interest in these devices, programmability and performance engineering in FPGAs remain hard. This work presents an evaluation of the most prominent code transformations targeting FPGAs. More specifically, it studies the performance effect of unrolling loops, replicating compute units and transferring data using DMA in a matrix multiplication OpenCL kernel through an Intel® FPGA. The results indicate that these optimizations can achieve speedups up to 3.78× for a matrix multiplication application, and 412.5× speedup in data transfer.

Download Full-text

IMPLEMENTATION OF MIDDLE PRODUCT ALGORITHM ON LINEAR PROCESSOR ARRAYS

Journal of Circuits System and Computers ◽

10.1142/s0218126614500807 ◽

2014 ◽

Vol 23 (06) ◽

pp. 1450080

Author(s):

E. I. MILOVANOVIĆ ◽

I. Ž. MILOVANOVIĆ ◽

M. K. STOJČEV

Keyword(s):

Execution Time ◽

Graphics Processing Units ◽

Performance Metrics ◽

Matrix Multiplication ◽

Software Tool ◽

Gain Factor ◽

Processor Array ◽

And Performance ◽

A Chain ◽

Graphics Processing

This paper presents the design, implementation and performance evaluation of the linear processor array accelerator for matrix multiplication. We call it matrix multiplication processor (MMP). The MMP is composed of n processing elements (PEs) connected in a chain, distributed memory, and dedicated address generator unit (AGU) to generate memory addresses. By using this approach, address generation does not increase the processing time. The AGU is one major difference in the proposed architecture compared to graphics processing units (GPUs) that use ALUs to create addresses. MMP is based on FPGA technology since this circuits possess extreme degree of parallelism and ability to customize the RAM and data path architecture to the computation. We have considered performance metrics of the proposed architecture in the sense of number of PEs, execution time, speedup, efficiency and gain factor. We have implemented AGU and PE in Xilinx Spartan 2E FPGAs using ISE 9.01 as a software tool. We compare our design with respect to the execution time, number of PEs, AT measure, speedup and efficiency with other solutions proposed in the literature.

Download Full-text

Dense Matrix Multiplication Algorithms and Performance Evaluation of HPCC in 81 Nodes IBM Power 8 Architecture

Computation ◽

10.3390/computation9080086 ◽

2021 ◽

Vol 9 (8) ◽

pp. 86

Author(s):

Eduardo Patricio Estévez Estévez Ruiz ◽

Giovanny Eduardo Caluña Caluña Chicaiza ◽

Fabian Rodolfo Jiménez Patiño ◽

Joaquín Cayetano López López Lago ◽

Saravana Prakash Thirumuruganandham

Keyword(s):

Performance Evaluation ◽

System Performance ◽

High Performance ◽

Matrix Multiplication ◽

Dense Matrix ◽

Current Configuration ◽

Performance Factors ◽

Reasonable Cost ◽

And Performance ◽

Performance Computing

Optimizing HPC systems based on performance factors and bottlenecks is essential for designing an HPC infrastructure with the best characteristics and at a reasonable cost. Such insight can only be achieved through a detailed analysis of existing HPC systems and the execution of their workloads. The “Quinde I” is the only and most powerful supercomputer in Ecuador and is currently listed third on the South America. It was built with the IBM Power 8 servers. In this work, we measured its performance using different parameters from High-Performance Computing (HPC) to compare it with theoretical values and values obtained from tests on similar models. To measure its performance, we compiled and ran different benchmarks with the specific optimization flags for Power 8 to get the maximum performance with the current configuration in the hardware installed by the vendor. The inputs of the benchmarks were varied to analyze their impact on the system performance. In addition, we compile and compare the performance of two algorithms for dense matrix multiplication SRUMMA and DGEMM.

Download Full-text

Transmission Scanning Electron Microscopy and Energy Analysis with the Siemens ELMISKOP 101

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100095509 ◽

1972 ◽

Vol 30 ◽

pp. 450-451

Author(s):

H. M. Thieringer

Keyword(s):

Objective Lens ◽

Transmission Microscopy ◽

Image Recording ◽

Mode Of Operation ◽

Scanning Transmission ◽

Electron Microscopes ◽

And Performance ◽

Convergent Beam ◽

Ray Path ◽

Beam Diffraction

It has repeatedly been show that with conventional electron microscopes very fine electron probes can be produced, therefore allowing various micro-techniques such as micro recording, X-ray microanalysis and convergent beam diffraction. In this paper the function and performance of an SIEMENS ELMISKOP 101 used as a scanning transmission microscope (STEM) is described. This mode of operation has some advantages over the conventional transmission microscopy (CTEM) especially for the observation of thick specimen, in spite of somewhat longer image recording times.Fig.1 shows schematically the ray path and the additional electronics of an ELMISKOP 101 working as a STEM. With a point-cathode, and using condensor I and the objective lens as a demagnifying system, an electron probe with a half-width ob about 25 Å and a typical current of 5.10-11 amp at 100 kV can be obtained in the back focal plane of the objective lens.

Download Full-text

Angular Resolved Electron Spectroscopy with Parallel Recording

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100135459 ◽

1990 ◽

Vol 48 (2) ◽

pp. 370-371

Author(s):

Huang Min ◽

P.S. Flora ◽

C.J. Harland ◽

J.A. Venables

Keyword(s):

Electron Spectroscopy ◽

Low Voltage ◽

Solid Angle ◽

Detection System ◽

Voltage Electron ◽

Cylindrical Mirror ◽

Inner Radius ◽

Channel Plate ◽

And Performance ◽

Type Detector

A cylindrical mirror analyser (CMA) has been built with a parallel recording detection system. It is being used for angular resolved electron spectroscopy (ARES) within a SEM. The CMA has been optimised for imaging applications; the inner cylinder contains a magnetically focused and scanned, 30kV, SEM electron-optical column. The CMA has a large inner radius (50.8mm) and a large collection solid angle (Ω > 1sterad). An energy resolution (ΔE/E) of 1-2% has been achieved. The design and performance of the combination SEM/CMA instrument has been described previously and the CMA and detector system has been used for low voltage electron spectroscopy. Here we discuss the use of the CMA for ARES and present some preliminary results.The CMA has been designed for an axis-to-ring focus and uses an annular type detector. This detector consists of a channel-plate/YAG/mirror assembly which is optically coupled to either a photomultiplier for spectroscopy or a TV camera for parallel detection.

Download Full-text

Novel epoxy/anhydride alternatives for biological electron microscopy: Physical and performance characteristis of embed 812 and LX 112 in combination with NSA/NMA/DMAE

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100156985 ◽

1989 ◽

Vol 47 ◽

pp. 1000-1001

Author(s):

Joe A. Mascorro ◽

Gerald S. Kirby

Keyword(s):

Electron Microscopy ◽

Flow Rate ◽

Succinic Anhydride ◽

Volume Flow Rate ◽

Mass Density ◽

Uranyl Acetate ◽

Volume Flow ◽

Base Resin ◽

And Performance ◽

Acid Anhydrides

Embedding media based upon an epoxy resin of choice and the acid anhydrides dodecenyl succinic anhydride (DDSA), nadic methyl anhydride (NMA), and catalyzed by the tertiary amine 2,4,6-Tri(dimethylaminomethyl) phenol (DMP-30) are widely used in biological electron microscopy. These media possess a viscosity character that can impair tissue infiltration, particularly if original Epon 812 is utilized as the base resin. Other resins that are considerably less viscous than Epon 812 now are available as replacements. Likewise, nonenyl succinic anhydride (NSA) and dimethylaminoethanol (DMAE) are more fluid than their counterparts DDSA and DMP- 30 commonly used in earlier formulations. This work utilizes novel epoxy and anhydride combinations in order to produce embedding media with desirable flow rate and viscosity parameters that, in turn, would allow the medium to optimally infiltrate tissues. Specifically, embeding media based on EmBed 812 or LX 112 with NSA (in place of DDSA) and DMAE (replacing DMP-30), with NMA remaining constant, are formulated and offered as alternatives for routine biological work.Individual epoxy resins (Table I) or complete embedding media (Tables II-III) were tested for flow rate and viscosity. The novel media were further examined for their ability to infilftrate tissues, polymerize, sectioning and staining character, as well as strength and stability to the electron beam and column vacuum. For physical comparisons, a volume (9 ml) of either resin or media was aspirated into a capillary viscocimeter oriented vertically. The material was then allowed to flow out freely under the influence of gravity and the flow time necessary for the volume to exit was recored (Col B,C; Tables). In addition, the volume flow rate (ml flowing/second; Col D, Tables) was measured. Viscosity (n) could then be determined by using the Hagen-Poiseville relation for laminar flow, n = c.p/Q, where c = a geometric constant from an instrument calibration with water, p = mass density, and Q = volume flow rate. Mass weight and density of the materials were determined as well (Col F,G; Tables). Infiltration schedules utilized were short (1/2 hr 1:1, 3 hrs full resin), intermediate (1/2 hr 1:1, 6 hrs full resin) , or long (1/2 hr 1:1, 6 hrs full resin) in total time. Polymerization schedules ranging from 15 hrs (overnight) through 24, 36, or 48 hrs were tested. Sections demonstrating gold interference colors were collected on unsupported 200- 300 mesh grids and stained sequentially with uranyl acetate and lead citrate.

Download Full-text