An Accelerator Design Using a MTCA Decomposition Algorithm for CNNs

Yunping Zhao; Jianzhuang Lu; Xiaowen Chen

doi:10.3390/s20195558

An Accelerator Design Using a MTCA Decomposition Algorithm for CNNs

Sensors ◽

10.3390/s20195558 ◽

2020 ◽

Vol 20 (19) ◽

pp. 5558

Author(s):

Yunping Zhao ◽

Jianzhuang Lu ◽

Xiaowen Chen

Keyword(s):

Large Scale ◽

Parallel Implementation ◽

Matrix Multiplication ◽

Storage Space ◽

Specific Calculation ◽

Computing Algorithm ◽

Architecture Framework ◽

The Matrix ◽

Field Programmable ◽

Accelerator Design

Due to the high throughput and high computing capability of convolutional neural networks (CNNs), researchers are paying increasing attention to the design of CNNs hardware accelerator architecture. Accordingly, in this paper, we propose a block parallel computing algorithm based on the matrix transformation computing algorithm (MTCA) to realize the convolution expansion and resolve the block problem of the intermediate matrix. It enables high parallel implementation on hardware. Moreover, we also provide a specific calculation method for the optimal partition of matrix multiplication to optimize performance. In our evaluation, our proposed method saves more than 60% of hardware storage space compared with the im2col(image to column) approach. More specifically, in the case of large-scale convolutions, it saves nearly 82% of storage space. Under the accelerator architecture framework designed in this paper, we realize the performance of 26.7GFLOPS-33.4GFLOPS (depending on convolution type) on FPGA(Field Programmable Gate Array) by reducing bandwidth and improving data reusability. It is 1.2×–4.0× faster than memory-efficient convolution (MEC) and im2col, respectively, and represents an effective solution for a large-scale convolution accelerator.

Download Full-text

An Accelerator Architecture of Changeable-Dimension Matrix Computing Method for SVM

Electronics ◽

10.3390/electronics8020143 ◽

2019 ◽

Vol 8 (2) ◽

pp. 143 ◽

Cited By ~ 1

Author(s):

Ruidong Wu ◽

Bing Liu ◽

Ping Fu ◽

Junbao Li ◽

Shou Feng

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Matrix Multiplication ◽

Critical Time ◽

Machine Learning Algorithms ◽

Support Vector ◽

Practical Applications ◽

The Matrix ◽

Field Programmable ◽

Computing Method

Matrix multiplication is a critical time-consuming processing step in many machine learning applications. Due to the diversity of practical applications, the matrix dimensions are generally not fixed. However, most matrix calculation methods, based on field programmable gate array (FPGA) currently use fixed matrix dimensions, which limit the flexibility of machine learning algorithms in a FPGA. The bottleneck lies in the limited FPGA resources. Therefore, this paper proposes an accelerator architecture for matrix computing method with changeable dimensions. Multi-matrix synchronous calculation concept allows matrix data to be processed continuously, which improves the parallel computing characteristics of FPGA and optimizes the computational efficiency. This paper tests matrix multiplication using support vector machine (SVM) algorithm to verify the performance of proposed architecture on the ZYNQ platform. The experimental results show that, compared to the software processing method, the proposed architecture increases the performance by 21.18 times with 9947 dimensions. The dimension is changeable with a maximum value of 2,097,151, without changing hardware design. This method is also applicable to matrix multiplication processing with other machine learning algorithms.

Download Full-text

Semitensor Product Compressive Sensing for Big Data Transmission in Wireless Sensor Networks

Mathematical Problems in Engineering ◽

10.1155/2017/8158465 ◽

2017 ◽

Vol 2017 ◽

pp. 1-8 ◽

Cited By ~ 3

Author(s):

Haipeng Peng ◽

Ye Tian ◽

Jürgen Kurths

Keyword(s):

Big Data ◽

Compressive Sensing ◽

Data Transmission ◽

Matrix Multiplication ◽

Wireless Sensor ◽

Storage Space ◽

Secret Key ◽

Encrypt Data ◽

The Matrix ◽

Data Volume

Big data transmission in wireless sensor network (WSN) consumes energy while the node in WSN is energy-limited, and the data transmitted needs to be encrypted resulting from the ease of being eavesdropped in WSN links. Compressive sensing (CS) can encrypt data and reduce the data volume to solve these two problems. However, the nodes in WSNs are not only energy-limited, but also storage and calculation resource-constrained. The traditional CS uses the measurement matrix as the secret key, which consumes a huge storage space. Moreover, the calculation cost of the traditional CS is large. In this paper, semitensor product compressive sensing (STP-CS) is proposed, which reduces the size of the secret key to save the storage space by breaking through the dimension match restriction of the matrix multiplication and decreases the calculation amount to save the calculation resource. Simulation results show that STP-CS encryption can achieve better performances of saving storage and calculation resources compared with the traditional CS encryption.

Download Full-text

Parallel Implementation of Large-Scale Linear Scaling Density Functional Theory Calculations With Numerical Atomic Orbitals in HONPAS

Frontiers in Chemistry ◽

10.3389/fchem.2020.589910 ◽

2020 ◽

Vol 8 ◽

Author(s):

Zhaolong Luo ◽

Xinming Qin ◽

Lingyun Wan ◽

Wei Hu ◽

Jinlong Yang

Keyword(s):

Density Matrix ◽

Dft Calculations ◽

Density Functional ◽

Large Scale ◽

Parallel Implementation ◽

Sparse Matrix ◽

Matrix Multiplication ◽

Linear Scaling ◽

Functional Theory ◽

Atomic Orbitals

Linear-scaling density functional theory (DFT) is an efficient method to describe the electronic structures of molecules, semiconductors, and insulators to avoid the high cubic-scaling cost in conventional DFT calculations. Here, we present a parallel implementation of linear-scaling density matrix trace correcting (TC) purification algorithm to solve the Kohn–Sham (KS) equations with the numerical atomic orbitals in the HONPAS package. Such a linear-scaling density matrix purification algorithm is based on the Kohn's nearsightedness principle, resulting in a sparse Hamiltonian matrix with localized basis sets in the DFT calculations. Therefore, sparse matrix multiplication is the most time-consuming step in the density matrix purification algorithm for linear-scaling DFT calculations. We propose to use the MPI_Allgather function for parallel programming to deal with the sparse matrix multiplication within the compressed sparse row (CSR) format, which can scale up to hundreds of processing cores on modern heterogeneous supercomputers. We demonstrate the computational accuracy and efficiency of this parallel density matrix purification algorithm by performing large-scale DFT calculations on boron nitrogen nanotubes containing tens of thousands of atoms.

Download Full-text

Analog Architecture Complexity Theory Empowering Ultra-Low Power Configurable Analog and Mixed Mode SoC Systems

Journal of Low Power Electronics and Applications ◽

10.3390/jlpea9010004 ◽

2019 ◽

Vol 9 (1) ◽

pp. 4 ◽

Cited By ~ 6

Author(s):

Jennifer Hasler

Keyword(s):

Large Scale ◽

Adaptive Filters ◽

Matrix Multiplication ◽

Ultra Low Power ◽

Analog To Digital ◽

Digital Architecture ◽

Architecture Framework ◽

Architecture Theory ◽

Analog Systems ◽

Vector Matrix

This discussion develops a theoretical analog architecture framework similar to the well developed digital architecture theory. Designing analog systems, whether small or large scale, must optimize their architectures for energy consumption. As in digital systems, a strong architecture theory, based on experimental results, is essential for these opportunities. The recent availability of programmable and configurable analog technologies, as well as the start of analog numerical analysis, makes considering scaling of analog computation more than a purely theoretical interest. Although some aspects nicely parallel digital architecture concepts, analog architecture theory requires revisiting some of the foundations of parallel digital architectures, particularly revisiting structures where communication and memory access, instead of processor operations, that dominates complexity. This discussion shows multiple system examples from Analog-to-Digital Converters (ADC) to Vector-Matrix Multiplication (VMM), adaptive filters, image processing, sorting, and other computing directions.

Download Full-text

A method for synthesis and optimization for linear nearest neighbor quantum circuits by parallel processing

Quantum Information and Computation ◽

10.26421/qic18.13-14-2 ◽

2018 ◽

Vol 18 (13&14) ◽

pp. 1095-1114

Author(s):

Zongyuan Zhang ◽

Zhijin Guan ◽

Hong Zhang ◽

Haiying Ma ◽

Weiping Ding

Keyword(s):

Parallel Processing ◽

Large Scale ◽

Nearest Neighbor ◽

Matrix Multiplication ◽

Quantum Circuit ◽

Quantum Circuits ◽

Quantum Cost ◽

The Matrix ◽

Speed Up ◽

Serial Algorithm

In order to realize the linear nearest neighbor{(LNN)} of the quantum circuits and reduce the quantum cost of linear reversible quantum circuits, a method for synthesizing and optimizing linear reversible quantum circuits based on matrix multiplication of the structure of the quantum circuit is proposed. This method shows the matrix representation of linear quantum circuits by multiplying matrices of different parts of the whole circuit. The LNN realization by adding the SWAP gates is proposed and the equivalence of two ways of adding the SWAP gates is proved. The elimination rules of the SWAP gates between two overlapped adjacent quantum gates in different cases are proposed, which reduce the quantum cost of quantum circuits after realizing the LNN architecture. We propose an algorithm based on parallel processing in order to effectively reduce the time consumption for large-scale quantum circuits. Experiments show that the quantum cost can be improved by 34.31\% on average and the speed-up ratio of the GPU-based algorithm can reach 4 times compared with the CPU-based algorithm. The average time optimization ratio of the benchmark large-scale circuits in RevLib processed by the parallel algorithm is {95.57\%} comparing with the serial algorithm.

Download Full-text

Parallel Implementation of Smith-Waterman Algorithm on FPGA

10.1101/2021.07.27.454006 ◽

2021 ◽

Author(s):

Fabio Oliveira F. de Oliveira ◽

Leonardo A. Dias ◽

Marcelo Fernandes

Keyword(s):

Sequence Alignment ◽

High Speed ◽

Parallel Implementation ◽

Biological Databases ◽

The Matrix ◽

Field Programmable ◽

Data Volume ◽

Alignment Technique ◽

High Speed Data ◽

High Level

In bioinformatics, alignment is an essential technique for finding similarities between biological sequences. Usually, the alignment is performed with the Smith-Waterman (SW) algorithm, a well-known sequence alignment technique of high-level precision based on dynamic programming. However, given the massive data volume in biological databases and their continuous exponential increase, high-speed data processing is necessary. Therefore, this work proposes a parallel hardware design for the SW algorithm with a systolic array structure to accelerate the Forward and Backtracking steps. For this purpose, the architecture calculates and stores the paths in the Forward stage for pre-organizing the alignment, which reduces the complexity of the Backtracking stage. The backtracking starts from the maximum score position in the matrix and generates the optimal SW sequence alignment path. The architecture was validated on Field-Programmable Gate Array (FPGA), and synthesis analyses have shown that the proposed design reaches up to 79.5 Giga Cell Updates per Second (GCPUS).

Download Full-text

On Randomized Trace Estimates for Indefinite Matrices with an Application to Determinants

Foundations of Computational Mathematics ◽

10.1007/s10208-021-09525-9 ◽

2021 ◽

Author(s):

Alice Cortinovis ◽

Daniel Kressner

Keyword(s):

Large Scale ◽

Gaussian Process Regression ◽

Likelihood Estimation ◽

Random Vectors ◽

Symmetric Positive Definite ◽

Matrix Logarithm ◽

The Matrix ◽

Trace Estimation ◽

Scale Matrix ◽

Existing Result

AbstractRandomized trace estimation is a popular and well-studied technique that approximates the trace of a large-scale matrix B by computing the average of $$x^T Bx$$ x T B x for many samples of a random vector X. Often, B is symmetric positive definite (SPD) but a number of applications give rise to indefinite B. Most notably, this is the case for log-determinant estimation, a task that features prominently in statistical learning, for instance in maximum likelihood estimation for Gaussian process regression. The analysis of randomized trace estimates, including tail bounds, has mostly focused on the SPD case. In this work, we derive new tail bounds for randomized trace estimates applied to indefinite B with Rademacher or Gaussian random vectors. These bounds significantly improve existing results for indefinite B, reducing the number of required samples by a factor n or even more, where n is the size of B. Even for an SPD matrix, our work improves an existing result by Roosta-Khorasani and Ascher (Found Comput Math, 15(5):1187–1212, 2015) for Rademacher vectors. This work also analyzes the combination of randomized trace estimates with the Lanczos method for approximating the trace of f(B). Particular attention is paid to the matrix logarithm, which is needed for log-determinant estimation. We improve and extend an existing result, to not only cover Rademacher but also Gaussian random vectors.

Download Full-text

Carbon Nanotubes and Polydopamine Modified Poly(dimethylsiloxane) Sponges for Efficient Oil–Water Separation

Materials ◽

10.3390/ma14092431 ◽

2021 ◽

Vol 14 (9) ◽

pp. 2431

Author(s):

Wen Zhang ◽

Juanjuan Wang ◽

Xue Han ◽

Lele Li ◽

Enping Liu ◽

...

Keyword(s):

Carbon Nanotubes ◽

Large Scale ◽

Absorption Capacity ◽

Potential Candidate ◽

Hard Template ◽

Water Separation ◽

Industrial Oil ◽

The Matrix ◽

Oil Water Separation ◽

Oil Water

In this paper, effective separation of oil from both immiscible oil–water mixtures and oil-in-water (O/W) emulsions are achieved by using poly(dimethylsiloxane)-based (PDMS-based) composite sponges. A modified hard template method using citric acid monohydrate as the hard template and dissolving it in ethanol is proposed to prepare PDMS sponge composited with carbon nanotubes (CNTs) both in the matrix and the surface. The introduction of CNTs endows the composite sponge with enhanced comprehensive properties including hydrophobicity, absorption capacity, and mechanical strength than the pure PDMS. We demonstrate the successful application of CNT-PDMS composite in efficient removal of oil from immiscible oil–water mixtures within not only a bath absorption, but also continuous separation for both static and turbulent flow conditions. This notable characteristic of the CNT-PDMS sponge enables it as a potential candidate for large-scale industrial oil–water separation. Furthermore, a polydopamine (PDA) modified CNT-PDMS is developed here, which firstly realizes the separation of O/W emulsion without continuous squeezing of the sponge. The combined superhydrophilic and superoleophilic property of PDA/CNT-PDMS is assumed to be critical in the spontaneously demulsification process.

Download Full-text

Experimental Research on Bond Behaviour of Fabric Reinforced Cementitious Matrix Composites for Retrofitting Masonry Walls

International Journal of Concrete Structures and Materials ◽

10.1186/s40069-021-00460-1 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Fayu Wang ◽

Nicholas Kyriakides ◽

Christis Chrysostomou ◽

Eleftherios Eleftheriou ◽

Renos Votsis ◽

...

Keyword(s):

Large Scale ◽

Tensile Tests ◽

Seismic Resistance ◽

Matrix Composites ◽

Bond Behaviour ◽

Cementitious Matrix ◽

The Matrix ◽

Rc Frame Buildings ◽

Direct Tensile ◽

Fabric Reinforced Cementitious Matrix

AbstractFabric reinforced cementitious matrix (FRCM) composites, also known as textile reinforced mortars (TRM), an inorganic matrix constituting fibre fabrics and cement-based mortar, are becoming a widely used composite material in Europe for upgrading the seismic resistance of existing reinforced concrete (RC) frame buildings. One way of providing seismic resistance upgrading is through the application of the proposed FRCM system on existing masonry infill walls to increase their stiffness and integrity. To examine the effectiveness of this application, the bond characteristics achieved between (a) the matrix and the masonry substrate and (b) the fabric and the matrix need to be determined. A series of experiments including 23 material performance tests, 15 direct tensile tests of dry fabric and composites, and 30 shear bond tests between the matrix and brick masonry, were carried out to investigate the fabric-to-matrix and matrix-to-substrate bond behaviour. In addition, different arrangements of extruded polystyrene (XPS) plates were applied to the FRCM to test the shear bond capacity of this insulation system when used on a large-scale wall.

Download Full-text

Enabling Large-Scale Simulations of Quantum Transport with Manycore Computing

Electronics ◽

10.3390/electronics10030253 ◽

2021 ◽

Vol 10 (3) ◽

pp. 253

Author(s):

Yosang Jeong ◽

Hoon Ryu

Keyword(s):

Quantum Transport ◽

Large Scale ◽

Performance Enhancement ◽

Silicon Nanowire ◽

Matrix Multiplication ◽

Tight Binding ◽

Optimization Techniques ◽

Wide Energy Range ◽

Processing Unit ◽

Binding Model

The non-equilibrium Green’s function (NEGF) is being utilized in the field of nanoscience to predict transport behaviors of electronic devices. This work explores how much performance improvement can be driven for quantum transport simulations with the aid of manycore computing, where the core numerical operation involves a recursive process of matrix multiplication. Major techniques adopted for performance enhancement are data restructuring, matrix tiling, thread scheduling, and offload computing, and we present technical details on how they are applied to optimize the performance of simulations in computing hardware, including Intel Xeon Phi Knights Landing (KNL) systems and NVIDIA general purpose graphic processing unit (GPU) devices. With a target structure of a silicon nanowire that consists of 100,000 atoms and is described with an atomistic tight-binding model, the effects of optimization techniques on the performance of simulations are rigorously tested in a KNL node equipped with two Quadro GV100 GPU devices, and we observe that computation is accelerated by a factor of up to ∼20 against the unoptimized case. The feasibility of handling large-scale workloads in a huge computing environment is also examined with nanowire simulations in a wide energy range, where good scalability is procured up to 2048 KNL nodes.

Download Full-text