Exploiting crosstalk to speed up on-chip buses

Author(s):  
C. Duan ◽  
S.P. Khatri
Keyword(s):  
Speed Up ◽  
Author(s):  
Ш.С. Фахми ◽  
Н.В. Шаталова ◽  
В.В. Вислогузов ◽  
Е.В. Костикова

В данной работе предлагаются математический аппарат и архитектура многопроцессорной транспортной системы на кристалле (МПТСнК). Выполнена программно-аппаратная реализация интеллектуальной системы видеонаблюдения на базе технологии «система на кристалле» и с использованием аппаратного ускорителя известного метода формирования опорных векторов. Архитектура включает в себя сложно-функциональные блоки анализа видеоинформации на базе параллельных алгоритмов нахождения опорных точек изображений и множества элементарных процессоров для выполнения сложных вычислительных процедур алгоритмов анализа с использованием средств проектирования на базе реконфигурируемой системы на кристалле, позволяющей оценить количество аппаратных ресурсов. Предлагаемая архитектура МПТСнК позволяет ускорить обработку и анализ видеоинформации при решении задач обнаружения и распознавания чрезвычайных ситуаций и подозрительных поведений. In this paper, we propose the mathematical apparatus and architecture of a multiprocessor transport system on a chip (MPTSoC). Software and hardware implementation of an intelligent video surveillance system based on the "system on chip" technology and using a hardware accelerator of the well-known method of forming reference vectors. The architecture includes complex functional blocks for analyzing video information based on parallel algorithms for finding image reference points and a set of elementary processors for performing complex computational procedures for algorithmic analysis. using design tools based on a reconfigurable system on chip that allows you to estimate the amount of hardware resources. The proposed MPTSoC architecture makes it possible to speed up the processing and analysis of video information when solving problems of detecting and recognizing emergencies and suspicious behaviors


2022 ◽  
Vol 15 (2) ◽  
pp. 1-33
Author(s):  
Mikhail Asiatici ◽  
Paolo Ienne

Applications such as large-scale sparse linear algebra and graph analytics are challenging to accelerate on FPGAs due to the short irregular memory accesses, resulting in low cache hit rates. Nonblocking caches reduce the bandwidth required by misses by requesting each cache line only once, even when there are multiple misses corresponding to it. However, such reuse mechanism is traditionally implemented using an associative lookup. This limits the number of misses that are considered for reuse to a few tens, at most. In this article, we present an efficient pipeline that can process and store thousands of outstanding misses in cuckoo hash tables in on-chip SRAM with minimal stalls. This brings the same bandwidth advantage as a larger cache for a fraction of the area budget, because outstanding misses do not need a data array, which can significantly speed up irregular memory-bound latency-insensitive applications. In addition, we extend nonblocking caches to generate variable-length bursts to memory, which increases the bandwidth delivered by DRAMs and their controllers. The resulting miss-optimized memory system provides up to 25% speedup with 24× area reduction on 15 large sparse matrix-vector multiplication benchmarks evaluated on an embedded and a datacenter FPGA system.


2012 ◽  
pp. 658-676
Author(s):  
Fabio Garzia ◽  
Roberto Airoldi ◽  
Jari Nurmi

This paper describes two general-purpose architectures targeted to Field Programmable Gate Array (FPGA) implementation. The first architecture is based on the coupling of a coarse-grain reconfigurable array with a general-purpose processor core. The second architecture is a homogeneous multi-processor system-on-chip (MP-SoC). Both architectures have been mapped onto two different Altera FPGA devices, a StratixII and a StratixIV. Although mapping onto the StratixIV results in higher operating frequencies, the capabilities of the device are not fully exploited. The implementation of a FFT on the two platforms shows a considerable speed-up in comparison with a single-processor reference architecture. The speed-up is higher in the reconfigurable solution but the MP-SoC provides an easier programming interface that is completely based on C language. The authors’ approach proves that implementing a programmable architecture on FPGA and then programming it using a high-level software language is a viable alternative to designing a dedicated hardware block with a hardware description language (HDL) and mapping it on FPGA.


2019 ◽  
Vol 5 (1) ◽  
pp. 16 ◽  
Author(s):  
Fahad Siddiqui ◽  
Sam Amiri ◽  
Umar Minhas ◽  
Tiantai Deng ◽  
Roger Woods ◽  
...  

FPGA-based embedded image processing systems offer considerable computing resources but present programming challenges when compared to software systems. The paper describes an approach based on an FPGA-based soft processor called Image Processing Processor (IPPro) which can operate up to 337 MHz on a high-end Xilinx FPGA family and gives details of the dataflow-based programming environment. The approach is demonstrated for a k-means clustering operation and a traffic sign recognition application, both of which have been prototyped on an Avnet Zedboard that has Xilinx Zynq-7000 system-on-chip (SoC). A number of parallel dataflow mapping options were explored giving a speed-up of 8 times for the k-means clustering using 16 IPPro cores, and a speed-up of 9.6 times for the morphology filter operation of the traffic sign recognition using 16 IPPro cores compared to their equivalent ARM-based software implementations. We show that for k-means clustering, the 16 IPPro cores implementation is 57, 28 and 1.7 times more power efficient (fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded GPU respectively.


Electronics ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 504
Author(s):  
Karine Avagian ◽  
Milica Orlandić

This paper proposes an implementation of a Richardson-Lucy (RL) deconvolution method to reduce the spatial degradation in hyperspectral images during the image acquisition process. The degradation, modeled by convolution with a point spread function (PSF), is reduced by applying both standard and accelerated RLdeconvolution algorithms on the individual images in spectral bands. Boundary conditions are introduced to maintain a constant image size without distorting the estimated image boundaries. The RL deconvolution algorithm is implemented on a field-programmable gate array (FPGA)-based Xilinx Zynq-7020 System-on-Chip (SoC). The proposed architecture is parameterized with respect to the image size and configurable with respect to the algorithm variant, the number of iterations, and the kernel size by setting the dedicated configuration registers. A speed-up by factors of 61 and 21 are reported compared to software-only and FPGA-based state-of-the-art implementations, respectively.


2018 ◽  
Vol 4 (2) ◽  
pp. 100042
Author(s):  
Robert Deelen ◽  
Martin Wieland ◽  
Susanne Gerber ◽  
David Fournier

Epigenetic features such as histone and DNA modifications are important mechanisms for the regulation of gene expression and for cell and tissue development. As a result, extensive efforts are currently undertaken using next-generation sequencing (NGS) to generate vast amounts of data regarding the epigenetic regulation of genomes. Several tools and frameworks for the processing of these NGS data have been developed in the last decade. Nevertheless, each user still bares the challenge to integrate all these tasks to perform the analysis. This procedure is not only tedious but also resource-intensive due to the putative large processing power involved. To automate, standardize and speed up the handling of NGS data, with focus on ChIP-seq data, we present a user-friendly pipeline that automatically processes a list of sequencing data files and returns a ready-to-use purified table for subsequent modelling or analysis attempts.


Author(s):  
Fabio Garzia ◽  
Roberto Airoldi ◽  
Jari Nurmi

This paper describes two general-purpose architectures targeted to Field Programmable Gate Array (FPGA) implementation. The first architecture is based on the coupling of a coarse-grain reconfigurable array with a general-purpose processor core. The second architecture is a homogeneous multi-processor system-on-chip (MP-SoC). Both architectures have been mapped onto two different Altera FPGA devices, a StratixII and a StratixIV. Although mapping onto the StratixIV results in higher operating frequencies, the capabilities of the device are not fully exploited. The implementation of a FFT on the two platforms shows a considerable speed-up in comparison with a single-processor reference architecture. The speed-up is higher in the reconfigurable solution but the MP-SoC provides an easier programming interface that is completely based on C language. The authors’ approach proves that implementing a programmable architecture on FPGA and then programming it using a high-level software language is a viable alternative to designing a dedicated hardware block with a hardware description language (HDL) and mapping it on FPGA.


Sensors ◽  
2020 ◽  
Vol 20 (8) ◽  
pp. 2322
Author(s):  
Ahsen Tahir ◽  
Gordon Morison ◽  
Dawn A. Skelton ◽  
Ryan M. Gibson

Falls are a leading cause of death in older adults and result in high levels of mortality, morbidity and immobility. Fall Detection Systems (FDS) are imperative for timely medical aid and have been known to reduce death rate by 80%. We propose a novel wearable sensor FDS which exploits fractal dynamics of fall accelerometer signals. Fractal dynamics can be used as an irregularity measure of signals and our work shows that it is a key discriminant for classification of falls from other activities of life. We design, implement and evaluate a hardware feature accelerator for computation of fractal features through multi-level wavelet transform on a reconfigurable embedded System on Chip, Zynq device for evaluating wearable accelerometer sensors. The proposed FDS utilises a hardware/software co-design approach with hardware accelerator for fractal features and software implementation of Linear Discriminant Analysis on an embedded ARM core for high accuracy and energy efficiency. The proposed system achieves 99.38% fall detection accuracy, 7.3× speed-up and 6.53× improvements in power consumption, compared to the software only execution with an overall performance per Watt advantage of 47.6×, while consuming low reconfigurable resources at 28.67%.


Sign in / Sign up

Export Citation Format

Share Document