A solution for automatic parallelization of sequential assembly code

Djordje Kovacevic; Mladen Stanojevic; Vladimir Marinkovic; Miroslav Popovic

doi:10.2298/sjee1301091k

A solution for automatic parallelization of sequential assembly code

Serbian Journal of Electrical Engineering ◽

10.2298/sjee1301091k ◽

2013 ◽

Vol 10 (1) ◽

pp. 91-101 ◽

Cited By ~ 2

Author(s):

Djordje Kovacevic ◽

Mladen Stanojevic ◽

Vladimir Marinkovic ◽

Miroslav Popovic

Keyword(s):

Multicore Processors ◽

Matrix Multiplication ◽

Automatic Parallelization ◽

Sequential Programs ◽

Assembly Code ◽

Linear Speedup ◽

The Matrix ◽

Speed Up ◽

Assembler Code ◽

Sequential Assembly

Since modern multicore processors can execute existing sequential programs only on a single core, there is a strong need for automatic parallelization of program code. Relying on existing algorithms, this paper describes one new software solution tool for parallelization of sequential assembly code. The main goal of this paper is to develop the parallelizator which reads sequential assembler code and at the output provides parallelized code for MIPS processor with multiple cores. The idea is the following: the parser translates assembler input file to program objects suitable for further processing. After that the static single assignment is done. Based on the data flow graph, the parallelization algorithm separates instructions on different cores. Once sequential code is parallelized by the parallelization algorithm, registers are allocated with the algorithm for linear allocation, and the result at the end of the program is distributed assembler code on each of the cores. In the paper we evaluate the speedup of the matrix multiplication example, which was processed by the parallelizator of assembly code. The result is almost linear speedup of code execution, which increases with the number of cores. The speed up on the two cores is 1.99, while on 16 cores the speed up is 13.88.

Download Full-text

CORRELATION-EXTREME VISUAL NAVIGATION OF UNMANNED AIRCRAFT SYSTEMS BASED ON SPEED-UP ROBUST FEATURES

Aviation ◽

10.3846/16487788.2014.926645 ◽

2014 ◽

Vol 18 (2) ◽

pp. 80-85 ◽

Cited By ~ 5

Author(s):

Volodymyr Kharchenko ◽

Maryna Mukhina

Keyword(s):

Euclidean Distance ◽

Matrix Multiplication ◽

Computation Time ◽

Visual Navigation ◽

Unmanned Aircraft ◽

Unmanned Aircraft Systems ◽

The Matrix ◽

Speed Up ◽

Matching Strategy ◽

Speed Up Robust Feature

The peculiarities of correlation-extreme visual navigation are considered. Descriptors with 64 elements of feature points of surface images are selected on the basis of the speed-up robust feature method. An analysis of possible criteria correlation functions is carried out to find the best match between the template descriptors and current images. The use of normalized correlation function is proposed based on the matrix multiplication properties of descriptors. It allows minimizing the number of false matches in comparison with the Euclidean distance in the descriptor space. The proposed matching strategy sufficiently decreases the computation time.

Download Full-text

A method for synthesis and optimization for linear nearest neighbor quantum circuits by parallel processing

Quantum Information and Computation ◽

10.26421/qic18.13-14-2 ◽

2018 ◽

Vol 18 (13&14) ◽

pp. 1095-1114

Author(s):

Zongyuan Zhang ◽

Zhijin Guan ◽

Hong Zhang ◽

Haiying Ma ◽

Weiping Ding

Keyword(s):

Parallel Processing ◽

Large Scale ◽

Nearest Neighbor ◽

Matrix Multiplication ◽

Quantum Circuit ◽

Quantum Circuits ◽

Quantum Cost ◽

The Matrix ◽

Speed Up ◽

Serial Algorithm

In order to realize the linear nearest neighbor{(LNN)} of the quantum circuits and reduce the quantum cost of linear reversible quantum circuits, a method for synthesizing and optimizing linear reversible quantum circuits based on matrix multiplication of the structure of the quantum circuit is proposed. This method shows the matrix representation of linear quantum circuits by multiplying matrices of different parts of the whole circuit. The LNN realization by adding the SWAP gates is proposed and the equivalence of two ways of adding the SWAP gates is proved. The elimination rules of the SWAP gates between two overlapped adjacent quantum gates in different cases are proposed, which reduce the quantum cost of quantum circuits after realizing the LNN architecture. We propose an algorithm based on parallel processing in order to effectively reduce the time consumption for large-scale quantum circuits. Experiments show that the quantum cost can be improved by 34.31\% on average and the speed-up ratio of the GPU-based algorithm can reach 4 times compared with the CPU-based algorithm. The average time optimization ratio of the benchmark large-scale circuits in RevLib processed by the parallel algorithm is {95.57\%} comparing with the serial algorithm.

Download Full-text

Exploring Parallelism to Improve the Performance of FrodoKEM in Hardware

Journal of Cryptographic Engineering ◽

10.1007/s13389-021-00258-7 ◽

2021 ◽

Author(s):

James Howe ◽

Marco Martinoli ◽

Elisabeth Oswald ◽

Francesco Regazzoni

Keyword(s):

Stream Cipher ◽

State Of The Art ◽

Matrix Multiplication ◽

First Order ◽

The Matrix ◽

Key Encapsulation Mechanism ◽

Speed Up ◽

Previous State ◽

Lattice Based Cryptography ◽

Hardware Designs

AbstractFrodoKEM is a lattice-based key encapsulation mechanism, currently a semi-finalist in NIST’s post-quantum standardisation effort. A condition for these candidates is to use NIST standards for sources of randomness (i.e. seed-expanding), and as such most candidates utilise SHAKE, an XOF defined in the SHA-3 standard. However, for many of the candidates, this module is a significant implementation bottleneck. Trivium is a lightweight, ISO standard stream cipher which performs well in hardware and has been used in previous hardware designs for lattice-based cryptography. This research proposes optimised designs for FrodoKEM, concentrating on high throughput by parallelising the matrix multiplication operations within the cryptographic scheme. This process is eased by the use of Trivium due to its higher throughput and lower area consumption. The parallelisations proposed also complement the addition of first-order masking to the decapsulation module. Overall, we significantly increase the throughput of FrodoKEM; for encapsulation we see a $$16\times $$ 16 × speed-up, achieving 825 operations per second, and for decapsulation we see a $$14\times $$ 14 × speed-up, achieving 763 operations per second, compared to the previous state of the art, whilst also maintaining a similar FPGA area footprint of less than 2000 slices.

Download Full-text

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Cluster Computing ◽

10.1007/s10586-016-0611-8 ◽

2016 ◽

Vol 19 (3) ◽

pp. 1037-1051 ◽

Cited By ~ 7

Author(s):

Sandra Catalán ◽

Francisco D. Igual ◽

Rafael Mayo ◽

Rafael Rodríguez-Sánchez ◽

Enrique S. Quintana-Ortí

Keyword(s):

Multicore Processors ◽

Matrix Multiplication ◽

Asymmetric Multicore Processors ◽

Asymmetric Multicore

Download Full-text

MRHS solver based on linear algebra and exhaustive search

Journal of Mathematical Cryptology ◽

10.1515/jmc-2017-0005 ◽

2018 ◽

Vol 12 (3) ◽

pp. 143-157 ◽

Cited By ~ 1

Author(s):

Håvard Raddum ◽

Pavol Zajac

Keyword(s):

Linear Algebra ◽

Matrix Representation ◽

Equation System ◽

Exhaustive Search ◽

Binary Matrix ◽

Algebraic Attacks ◽

The Matrix ◽

Speed Up ◽

Symmetric Key

Abstract We show how to build a binary matrix from the MRHS representation of a symmetric-key cipher. The matrix contains the cipher represented as an equation system and can be used to assess a cipher’s resistance against algebraic attacks. We give an algorithm for solving the system and compute its complexity. The complexity is normally close to exhaustive search on the variables representing the user-selected key. Finally, we show that for some variants of LowMC, the joined MRHS matrix representation can be used to speed up regular encryption in addition to exhaustive key search.

Download Full-text

PENERAPAN METODE MULTI ATTRIBUTE UTILITY THEORY (MAUT) DALAM PEMETAAN TINGKAT DAMPAK BENCANA BANJIR DI KABUPATEN BANTUL

Telematika ◽

10.31315/telematika.v17i1.3402 ◽

2020 ◽

Vol 17 (1) ◽

pp. 26

Author(s):

Afif Irfan Abdurrahman ◽

Bambang Yuwono ◽

Yuli Fauziah

Keyword(s):

Utility Theory ◽

Matrix Multiplication ◽

Flood Disaster ◽

Accuracy Rate ◽

Affected Area ◽

A Value ◽

Multi Attribute Utility Theory ◽

The Matrix ◽

The Difference ◽

The Impact

Flood disaster is a dangerous disaster, an event that occurs due to overflow of water resulting in submerged land is called a flood disaster. Almost every year Bantul Regency is affected by floods due to high rainfall. The flood disaster that struck in Bantul Regency made the Bantul District Disaster Management Agency (BPBD) difficult to handle so that it needed a mapping of the level of the impact of the flood disaster to minimize the occurrence of floods and provide information to the public.This study will create a system to map the level of impact of floods in Bantul Regency with a decision support method namely Multi Attribute Utility Theory (MAUT). The MAUT method stage in determining the level of impact of flood disasters through the process of normalization and matrix multiplication. The method helps in determining the areas affected by floods, by managing the Indonesian Disaster Information Data (DIBI). The data managed is data on criteria for the death toll, lost victims, damage to houses, damage to public facilities, and damage to roads. Each criteria data has a value that can be used to determine the level of impact of a flood disaster. The stages for determining the level of impact of a disaster require a weighting calculation process. The results of the weighting process display the scoring value which has a value of 1 = low, 2 = moderate, 3 = high. To assist in determining the affected areas using the matrix normalization and multiplication process the process is the application of the Multi Attribute Utility Theory (MAUT) method.This study resulted in a mapping of the level of impact displayed on google maps. The map view shows the affected area points and the level of impact of the flood disaster in Bantul Regency. The mapping produced from the DIBI data in 2017 produced the highest affected area in the Imogiri sub-district. The results of testing the data can be concluded that the results of this study have an accuracy rate of 95% when compared with the results of the mapping previously carried out by BPBD Bantul Regency. The difference in the level of accuracy is because the criteria data used are not the same as the criteria data used by BPBD in Bantul Regency so that the accuracy rate is 95%.

Download Full-text

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Electronics ◽

10.3390/electronics10161984 ◽

2021 ◽

Vol 10 (16) ◽

pp. 1984

Author(s):

Wei Zhang ◽

Zihao Jiang ◽

Zhiguang Chen ◽

Nong Xiao ◽

Yang Ou

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Multicore Processors ◽

Matrix Multiplication ◽

Memory Access ◽

Double Precision ◽

Competitive Performance ◽

General Matrix ◽

Remarkable Improvement ◽

Task Independence

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Download Full-text

On Solutions of the Matrix Equation A ∘ l X = B with respect to M M -2 Semitensor Product

Journal of Mathematics ◽

10.1155/2021/6651434 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Jin Wang

Keyword(s):

Matrix Equation ◽

Matrix Multiplication ◽

Sufficient Condition ◽

Mathematical Tool ◽

Necessary And Sufficient Condition ◽

The Matrix ◽

Necessary And Sufficient

M M -2 semitensor product is a new and very useful mathematical tool, which breaks the limitation of traditional matrix multiplication on the dimension of matrices and has a wide application prospect. This article aims to investigate the solutions of the matrix equation A ° l X = B with respect to M M -2 semitensor product. The case where the solutions of the equation are vectors is discussed first. Compatible conditions of matrices and the necessary and sufficient condition for the solvability is studied successively. Furthermore, concrete methods of solving the equation are provided. Then, the case where the solutions of the equation are matrices is studied in a similar way. Finally, several examples are given to illustrate the efficiency of the results.

Download Full-text

Fast 3D Block Parallelisation for the Matrix Multiplication Prefix Problem

High Performance Computing in Science and Engineering, Garching/Munich 2009 ◽

10.1007/978-3-642-13872-0_4 ◽

2010 ◽

pp. 39-50 ◽

Cited By ~ 1

Author(s):

K. Waldherr ◽

T. Huckle ◽

T. Auckenthaler ◽

U. Sander ◽

T. Schulte-Herbrüggen

Keyword(s):

Matrix Multiplication ◽

The Matrix

Download Full-text

An Accelerator Architecture of Changeable-Dimension Matrix Computing Method for SVM

Electronics ◽

10.3390/electronics8020143 ◽

2019 ◽

Vol 8 (2) ◽

pp. 143 ◽

Cited By ~ 1

Author(s):

Ruidong Wu ◽

Bing Liu ◽

Ping Fu ◽

Junbao Li ◽

Shou Feng

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Matrix Multiplication ◽

Critical Time ◽

Machine Learning Algorithms ◽

Support Vector ◽

Practical Applications ◽

The Matrix ◽

Field Programmable ◽

Computing Method

Matrix multiplication is a critical time-consuming processing step in many machine learning applications. Due to the diversity of practical applications, the matrix dimensions are generally not fixed. However, most matrix calculation methods, based on field programmable gate array (FPGA) currently use fixed matrix dimensions, which limit the flexibility of machine learning algorithms in a FPGA. The bottleneck lies in the limited FPGA resources. Therefore, this paper proposes an accelerator architecture for matrix computing method with changeable dimensions. Multi-matrix synchronous calculation concept allows matrix data to be processed continuously, which improves the parallel computing characteristics of FPGA and optimizes the computational efficiency. This paper tests matrix multiplication using support vector machine (SVM) algorithm to verify the performance of proposed architecture on the ZYNQ platform. The experimental results show that, compared to the software processing method, the proposed architecture increases the performance by 21.18 times with 9947 dimensions. The dimension is changeable with a maximum value of 2,097,151, without changing hardware design. This method is also applicable to matrix multiplication processing with other machine learning algorithms.

Download Full-text