scholarly journals CENNA: Cost-Effective Neural Network Accelerator

Electronics ◽  
2020 ◽  
Vol 9 (1) ◽  
pp. 134
Author(s):  
Sang-Soo Park ◽  
Ki-Seok Chung

Convolutional neural networks (CNNs) are widely adopted in various applications. State-of-the-art CNN models deliver excellent classification performance, but they require a large amount of computation and data exchange because they typically employ many processing layers. Among these processing layers, convolution layers, which carry out many multiplications and additions, account for a major portion of computation and memory access. Therefore, reducing the amount of computation and memory access is the key for high-performance CNNs. In this study, we propose a cost-effective neural network accelerator, named CENNA, whose hardware cost is reduced by employing a cost-centric matrix multiplication that employs both Strassen’s multiplication and a naïve multiplication. Furthermore, the convolution method using the proposed matrix multiplication can minimize data movement by reusing both the feature map and the convolution kernel without any additional control logic. In terms of throughput, power consumption, and silicon area, the efficiency of CENNA is up to 88 times higher than that of conventional designs for the CNN inference.

Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.


2017 ◽  
Vol 1 (4) ◽  
pp. 271-277 ◽  
Author(s):  
Abdullah Caliskan ◽  
Mehmet Emin Yuksel

Abstract In this study, a deep neural network classifier is proposed for the classification of coronary artery disease medical data sets. The proposed classifier is tested on reference CAD data sets from the literature and also compared with popular representative classification methods regarding its classification performance. Experimental results show that the deep neural network classifier offers much better accuracy, sensitivity and specificity rates when compared with other methods. The proposed method presents itself as an easily accessible and cost-effective alternative to currently existing methods used for the diagnosis of CAD and it can be applied for easily checking whether a given subject under examination has at least one occluded coronary artery or not.


2022 ◽  
Vol 15 (3) ◽  
pp. 1-31
Author(s):  
Shulin Zeng ◽  
Guohao Dai ◽  
Hanbo Sun ◽  
Jun Liu ◽  
Shiyao Li ◽  
...  

INFerence-as-a-Service (INFaaS) has become a primary workload in the cloud. However, existing FPGA-based Deep Neural Network (DNN) accelerators are mainly optimized for the fastest speed of a single task, while the multi-tenancy of INFaaS has not been explored yet. As the demand for INFaaS keeps growing, simply increasing the number of FPGA-based DNN accelerators is not cost-effective, while merely sharing these single-task optimized DNN accelerators in a time-division multiplexing way could lead to poor isolation and high-performance loss for INFaaS. On the other hand, current cloud-based DNN accelerators have excessive compilation overhead, especially when scaling out to multi-FPGA systems for multi-tenant sharing, leading to unacceptable compilation costs for both offline deployment and online reconfiguration. Therefore, it is far from providing efficient and flexible FPGA virtualization for public and private cloud scenarios. Aiming to solve these problems, we propose a unified virtualization framework for general-purpose deep neural networks in the cloud, enabling multi-tenant sharing for both the Convolution Neural Network (CNN), and the Recurrent Neural Network (RNN) accelerators on a single FPGA. The isolation is enabled by introducing a two-level instruction dispatch module and a multi-core based hardware resources pool. Such designs provide isolated and runtime-programmable hardware resources, which further leads to performance isolation for multi-tenant sharing. On the other hand, to overcome the heavy re-compilation overheads, a tiling-based instruction frame package design and a two-stage static-dynamic compilation, are proposed. Only the lightweight runtime information is re-compiled with ∼1 ms overhead, thus guaranteeing the private cloud’s performance. Finally, the extensive experimental results show that the proposed virtualized solutions achieve up to 3.12× and 6.18× higher throughput in the private cloud compared with the static CNN and RNN baseline designs, respectively.


2021 ◽  
Vol 15 ◽  
Author(s):  
Hengjin Ke ◽  
Cang Cai ◽  
Fengqin Wang ◽  
Fang Hu ◽  
Jiawei Tang ◽  
...  

Online end-to-end electroencephalogram (EEG) classification with high performance can assess the brain status of patients with Major Depression Disabled (MDD) and track their development status in time with minimizing the risk of falling into danger and suicide. However, it remains a grand research challenge due to (1) the embedded intensive noises and the intrinsic non-stationarity determined by the evolution of brain states, (2) the lack of effective decoupling of the complex relationship between neural network and brain state during the attack of brain diseases. This study designs a Frequency Channel-based convolutional neural network (CNN), namely FCCNN, to accurately and quickly identify depression, which fuses the brain rhythm to the attention mechanism of the classifier with aiming at focusing the most important parts of data and improving the classification performance. Furthermore, to understand the complexity of the classifier, this study proposes a calculation method of information entropy based on the affinity propagation (AP) clustering partition to measure the complexity of the classifier acting on each channel or brain region. We perform experiments on depression evaluation to identify healthy and MDD. Results report that the proposed solution can identify MDD with an accuracy of 99±0.08%, the sensitivity of 99.07±0.05%, and specificity of 98.90±0.14%. Furthermore, the experiments on the quantitative interpretation of FCCNN illustrate significant differences between the frontal, left, and right temporal lobes of depression patients and the healthy control group.


2020 ◽  
Vol 15 (1) ◽  
pp. 15
Author(s):  
Felix Bach ◽  
Björn Schembera ◽  
Jos Van Wezel

Research data as the true valuable good in science must be saved and subsequently kept findable, accessible and reusable for reasons of proper scientific conduct for a time span of several years. However, managing long-term storage of research data is a burden for institutes and researchers. Because of the sheer size and the required retention time apt storage providers are hard to find. Aiming to solve this puzzle, the bwDataArchive project started development of a long-term research data archive that is reliable, cost effective and able store multiple petabytes of data. The hardware consists of data storage on magnetic tape, interfaced with disk caches and nodes for data movement and access. On the software side, the High Performance Storage System (HPSS) was chosen for its proven ability to reliably store huge amounts of data. However, the implementation of bwDataArchive is not dependant on HPSS. For authentication the bwDataArchive is integrated into the federated identity management for educational institutions in the State of Baden-Württemberg in Germany. The archive features data protection by means of a dual copy at two distinct locations on different tape technologies, data accessibility by common storage protocols, data retention assurance for more than ten years, data preservation with checksums, and data management capabilities supported by a flexible directory structure allowing sharing and publication. As of September 2019, the bwDataArchive holds over 9 PB and 90 million files and sees a constant increase in usage and users from many communities.


2021 ◽  
Author(s):  
Manomita Chakraborty ◽  
Saroj Kumar Biswas ◽  
Biswajit Purkayastha

Abstract Neural networks are known for providing impressive classification performance, and the ensemble learning technique is further acting as a catalyst to enhance this performance by integrating multiple networks. But like neural networks, neural network ensembles are also considered as a black-box because they cannot explain their decision making process. So, despite having high classification performance, neural networks and their ensembles are not suited for some applications which require explainable decisions. However, the rule extraction technique can overcome this drawback by representing the knowledge learned by a neural network in the guise of interpretable decision rules. A rule extraction algorithm provides neural networks with the power to justify their classification responses through explainable classification rules. Several rule extraction algorithms exist to extract classification rules from neural networks, but only a few of them generates rules using neural network ensembles. So this paper proposes an algorithm named Rule Extraction using Ensemble of Neural Network Ensembles (RE-E-NNES) to demonstrate the high performance of neural network ensembles through rule extraction. RE-E-NNES extracts classification rules by ensembling several neural network ensembles. Results show the efficacy of the proposed RE-E-NNES algorithm compared to different existing rule extraction algorithms.


2020 ◽  
Vol 12 (4) ◽  
Author(s):  
Myo Taeg Lim ◽  
Dong Won Kim ◽  
Jun Ho Chung ◽  
Woo Jin Ahn ◽  
Sang Kyoo Park ◽  
...  

Convolutional Neural Network is one of state-of-the-arts and demonstrated its superior performance in various computer visions systems recently. The conventional convolutional neural network has one-way structure to train image information with fixed sizes of filters in general. However, this structure only learns image information followed by one fixed sizes of filters and this is not the best to achieve high performance of the network. In order to achieve high performance of the network, this paper suggests a novel convolutional neural network which consists of spatial transformer network and multi-structure convolutional neural network. Spatial transformer network is robust against distorted images. Multi-structure convolutional neural network uses different sizes of filters for global and local information from the given images. The proposed algorithm, spatial transformer with multi-structure convolutional neural network (SPMCNN) demonstrates its classification performance on German traffic sign recognition benchmark.


TAPPI Journal ◽  
2018 ◽  
Vol 17 (09) ◽  
pp. 507-515 ◽  
Author(s):  
David Skuse ◽  
Mark Windebank ◽  
Tafadzwa Motsi ◽  
Guillaume Tellier

When pulp and minerals are co-processed in aqueous suspension, the mineral acts as a grinding aid, facilitating the cost-effective production of fibrils. Furthermore, this processing allows the utilization of robust industrial milling equipment. There are 40000 dry metric tons of mineral/microfbrillated (MFC) cellulose composite production capacity in operation across three continents. These mineral/MFC products have been cleared by the FDA for use as a dry and wet strength agent in coated and uncoated food contact paper and paperboard applications. We have previously reported that use of these mineral/MFC composite materials in fiber-based applications allows generally improved wet and dry mechanical properties with concomitant opportunities for cost savings, property improvements, or grade developments and that the materials can be prepared using a range of fibers and minerals. Here, we: (1) report the development of new products that offer improved performance, (2) compare the performance of these new materials with that of a range of other nanocellulosic material types, (3) illustrate the performance of these new materials in reinforcement (paper and board) and viscosification applications, and (4) discuss product form requirements for different applications.


2011 ◽  
Vol 39 (3) ◽  
pp. 193-209 ◽  
Author(s):  
H. Surendranath ◽  
M. Dunbar

Abstract Over the last few decades, finite element analysis has become an integral part of the overall tire design process. Engineers need to perform a number of different simulations to evaluate new designs and study the effect of proposed design changes. However, tires pose formidable simulation challenges due to the presence of highly nonlinear rubber compounds, embedded reinforcements, complex tread geometries, rolling contact, and large deformations. Accurate simulation requires careful consideration of these factors, resulting in the extensive turnaround time, often times prolonging the design cycle. Therefore, it is extremely critical to explore means to reduce the turnaround time while producing reliable results. Compute clusters have recently become a cost effective means to perform high performance computing (HPC). Distributed memory parallel solvers designed to take advantage of compute clusters have become increasingly popular. In this paper, we examine the use of HPC for various tire simulations and demonstrate how it can significantly reduce simulation turnaround time. Abaqus/Standard is used for routine tire simulations like footprint and steady state rolling. Abaqus/Explicit is used for transient rolling and hydroplaning simulations. The run times and scaling data corresponding to models of various sizes and complexity are presented.


Sign in / Sign up

Export Citation Format

Share Document