Resource Efficient Hardware Architecture for Fast Computation of Running Max/Min Filters

The Scientific World JOURNAL ◽

10.1155/2013/108103 ◽

2013 ◽

Vol 2013 ◽

pp. 1-10 ◽

Cited By ~ 2

Author(s):

Cesar Torres-Huitzil

Keyword(s):

Digital Signal ◽

Hardware Architecture ◽

Clock Frequency ◽

Kernel Size ◽

One Dimensional ◽

Field Programmable ◽

Custom Hardware ◽

Performance Area ◽

Memory Resources ◽

Embedded Applications

Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with ak×kkernel requires ofk2−1comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel sizek. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on1024×1024images with up to255×255kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding.

Download Full-text

Comparison of Power Converter Models with Losses for Hardware-in-the-Loop Using Different Numerical Formats

Electronics ◽

10.3390/electronics8111255 ◽

2019 ◽

Vol 8 (11) ◽

pp. 1255

Author(s):

Elyas Zamiri ◽

Alberto Sanchez ◽

Angel de Castro ◽

Maria Sofia Martínez-García

Keyword(s):

Fixed Point ◽

High Speed ◽

Large Data ◽

Digital Signal ◽

Power Converter ◽

Hardware In The Loop ◽

Clock Frequency ◽

Point Model ◽

Embedded Dsp ◽

Field Programmable

Nowadays, the Hardware-In-the-Loop (HIL) technique is widely used to test different power electronic converters. These real-time simulations require processing large data at high speed, which makes this application very suitable for FPGAs (Field Programmable Gate Array) as they are capable of parallel processing. This paper provides an analytical discussion on three HIL models for a full-bridge converter. The three models use different possible numerical formats, namely float and fixed-point, the latter with and without optimizing the width of signals to the embedded DSP (Digital Signal Processors) blocks of the FPGA. The optimized fixed-point model (OFPM) uses three and two times fewer DSP blocks or LUTs (Look Up Tables), and the maximum achievable clock frequency is also up to 35 % and 25 % higher than the float model and non-optimized fixed-point model (nOFPM), respectively. Furthermore, the models’ accuracy is proportional to the clock frequency, thus the OFPM is also the most accurate model. Finally, the paper shows the differences in the simulation when the models include or not losses, proving that not including losses leads to high errors, especially during transients.

Download Full-text

Embedded GPU Implementation for High-Performance Ultrasound Imaging

Electronics ◽

10.3390/electronics10080884 ◽

2021 ◽

Vol 10 (8) ◽

pp. 884

Author(s):

Stefano Rossi ◽

Enrico Boni

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Digital Signal ◽

Processing Unit ◽

Embedded Computing ◽

Field Programmable ◽

Peripheral Component Interconnect ◽

Programmable Gate Arrays ◽

Graphics Processing ◽

Signal Processors

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

Download Full-text

High Speed FPGA Based 128-bit Advance Encryption Standard (AES)

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327911666210201104151 ◽

2021 ◽

Vol 11 ◽

Author(s):

Ibrahem M. T. Hamidi ◽

Farah S. H. Al-aassi

Keyword(s):

High Throughput ◽

Field Programmable Gate Array ◽

High Speed ◽

Design Tool ◽

Advanced Encryption Standard ◽

Clock Frequency ◽

Advance Encryption Standard ◽

Field Programmable ◽

Internal Component ◽

Gate Array

Aim: Achieve high throughput 128 bits FPGA based Advanced Encryption Standard. Background: Field Programmable Gate Array (FPGA) provides an efficient platform for design AES cryptography system. It provides the capability to control over each bit using HDL programming language such as VHDL and Verilog which results an output speed in Gbps rang. Objective: Use Field Programmable Gate Array (FPGA) to design high throughput 128 bits FPGA based Advanced Encryption Standard. Method: Pipelining technique has used to achieve maximum possible speed. The level of pipelining includes round pipelining and internal component pipelining where number of registers inserted in particular places to increase the output speed. The proposed design uses combinatorial logic to implement the byte substitution. The s-box implemented using composed field arithmetic with 7 stages of pipelining to reduce the combinatorial logic level. The presented model has implemented using VHDL in Xilinix ISETM 14.4 design tool. Result: The achieved results were 18.55 Gbps at a clock frequency of 144.96 MHz and area of 1568 Slices in Spartan3 xc3s1000 hardware. Conclusion: The results show that the proposed design reaches a high throughput with acceptable area usage compare with other designs in the literature.

Download Full-text

Study on Enhancing Terminal Identification in Track and Field Using Digital X-Ray Photography Image

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.989-994.3851 ◽

2014 ◽

Vol 989-994 ◽

pp. 3851-3855

Author(s):

Guang Jin Lai

Keyword(s):

Motion Capture ◽

Clustering Algorithm ◽

Digital Signal ◽

Human Head ◽

Track And Field ◽

Image Clustering ◽

Digital Signals ◽

Recognition Method ◽

One Dimensional ◽

X Ray

Digital X-ray photography technology is under the control of the computer, to use one-dimensional or 2D X-ray detector to convert the captured image into digital signals directly to using image processing technology. It can realize the function of image analysis. We introduce X-ray photography technology into the terminal identification in track and field, and use the clustering algorithm to improve computer image clustering algorithm. Through capturing the digital signal of human head, arms and legs, it enhances the terminal recognition method in track and field. Finally we use MATLAB to calculate the captured image value of X-ray photography. Through calculation, motion capture and recognition of X-ray image are enhanced obviously. It provides a theoretical basis for researching on motion capture technology in track and field.

Download Full-text

One-dimensional digital signal processing

Image and Vision Computing ◽

10.1016/0262-8856(83)90010-0 ◽

1983 ◽

Vol 1 (1) ◽

pp. 58

Author(s):

KG Beauchamp

Keyword(s):

Signal Processing ◽

Digital Signal Processing ◽

Digital Signal ◽

One Dimensional

Download Full-text

Argus CNN Accelerator Based on Kernel Clustering and Resource-Aware Pruning

Elektronika ir Elektrotechnika ◽

10.5755/j02.eie.28922 ◽

2021 ◽

Vol 27 (3) ◽

pp. 57-70

Author(s):

Damjan M. Rakanovic ◽

Vuk Vranjkovic ◽

Rastislav J. R. Struharik

Keyword(s):

Digital Signal Processor ◽

State Of The Art ◽

Digital Signal ◽

Pruning Algorithm ◽

Kernel Clustering ◽

Field Programmable ◽

Comparable Performance ◽

On Chip ◽

Resource Characteristics ◽

Resource Aware

Paper proposes a two-step Convolutional Neural Network (CNN) pruning algorithm and resource-efficient Field-programmable gate array (FPGA) CNN accelerator named “Argus”. The proposed CNN pruning algorithm first combines similar kernels into clusters, which are then pruned using the same regular pruning pattern. The pruning algorithm is carefully tailored for FPGAs, considering their resource characteristics. Regular sparsity results in high Multiply-accumulate (MAC) efficiency, reducing the amount of logic required to balance workloads among different MAC units. As a result, the Argus accelerator requires about 170 Look-up tables (LUTs) per Digital Signal Processor (DSP) block. This number is close to the average LUT/DPS ratio for various FPGA families, enabling balanced resource utilization when implementing Argus. Benchmarks conducted using Xilinx Zynq Ultrascale + Multi-Processor System-on-Chip (MPSoC) indicate that Argus is achieving up to 25 times higher frames per second than NullHop, 2 and 2.5 times higher than NEURAghe and Snowflake, respectively, and 2 times higher than NVDLA. Argus shows comparable performance to MIT’s Eyeriss v2 and Caffeine, requiring up to 3 times less memory bandwidth and utilizing 4 times fewer DSP blocks, respectively. Besides the absolute performance, Argus has at least 1.3 and 2 times better GOP/s/DSP and GOP/s/Block-RAM (BRAM) ratios, while being competitive in terms of GOP/s/LUT, compared to some of the state-of-the-art solutions.

Download Full-text

An ILP solution to address code generation for embedded applications on digital signal processors

ACM Transactions on Design Automation of Electronic Systems ◽

10.1145/2209291.2209301 ◽

2012 ◽

Vol 17 (3) ◽

pp. 1-23

Author(s):

Hassan Salamy ◽

J. Ramanujam

Keyword(s):

Code Generation ◽

Digital Signal ◽

Digital Signal Processors ◽

Signal Processors ◽

Embedded Applications

Download Full-text

Implementation of Biologically Inspired Components in Embedded Vision Systems

Developing and Applying Biologically-Inspired Vision Systems ◽

10.4018/978-1-4666-2539-6.ch013 ◽

2012 ◽

pp. 307-345 ◽

Cited By ~ 1

Author(s):

Christopher Wing Hong Ngau ◽

Li-Minn Ang ◽

Kah Phooi Seng

Keyword(s):

Field Programmable Gate Array ◽

Resource Constraints ◽

Low Complexity ◽

Primary Concern ◽

Complex Data ◽

Biologically Inspired ◽

Computational Speed ◽

Visual Tasks ◽

Field Programmable ◽

Memory Resources

Studies in the area of computational vision have shown the capability of visual attention (VA) processing in aiding various visual tasks by providing a means for simplifying complex data handling and supporting action decisions using readily available low-level features. Due to the inclusion of computational biological vision components to mimic the mechanism of the human visual system, VA processing is computationally complex with heavy memory requirements and is often found implemented in workstations with unapplied resource constraints. In embedded systems, the computational capacity and memory resources are of a primary concern. To allow VA processing in such systems, the chapter presents a low complexity, low memory VA model based on an established mainstream VA model that addresses critical factors in terms of algorithm complexity, memory requirements, computational speed, and salience prediction performance to ensure the reliability of the VA processing in an environment with limited resources. Lastly, a custom softcore microprocessor-based hardware implementation on a Field-Programmable Gate Array (FPGA) is used to verify the implementation feasibility of the presented low complexity, low memory VA model.

Download Full-text

Automatic Tool for Fast Generation of Custom Convolutional Neural Networks Accelerators for FPGA

Electronics ◽

10.3390/electronics8060641 ◽

2019 ◽

Vol 8 (6) ◽

pp. 641 ◽

Cited By ~ 7

Author(s):

Miguel Rivera-Acosta ◽

Susana Ortega-Cisneros ◽

Jorge Rivera

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

State Of The Art ◽

Third Party ◽

Hardware Accelerators ◽

Field Programmable ◽

Custom Hardware ◽

Architecture Description ◽

Automatic Tool

This paper presents a platform that automatically generates custom hardware accelerators for convolutional neural networks (CNNs) implemented in field-programmable gate array (FPGA) devices. It includes a user interface for configuring and managing these accelerators. The herein-presented platform can perform all the processes necessary to design and test CNN accelerators from the CNN architecture description at both layer and internal parameter levels, training the desired architecture with any dataset and generating the configuration files required by the platform. With these files, it can synthesize the register-transfer level (RTL) and program the customized CNN accelerator into the FPGA device for testing, making it possible to generate custom CNN accelerators quickly and easily. All processes save the CNN architecture description are fully automatized and carried out by the platform, which manages third-party software to train the CNN and synthesize and program the generated RTL. The platform has been tested with the implementation of some of the CNN architectures found in the state-of-the-art for freely available datasets such as MNIST, CIFAR-10, and STL-10.

Download Full-text

Heterogenous Computing on Iris Matching with OpenCL

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.850.129 ◽

2016 ◽

Vol 850 ◽

pp. 129-135

Author(s):

Buğra Şimşek ◽

Nursel Akçam

Keyword(s):

Graphics Processing Units ◽

Iris Recognition ◽

Heterogeneous Computing ◽

Hamming Distance ◽

Heterogeneous Systems ◽

Digital Signal ◽

Mobile Platforms ◽

Central Processing ◽

Field Programmable ◽

Graphics Processing

This study presents parallelization of Hamming Distance algorithm, which is used for iris comparison on iris recognition systems, for heterogeneous systems that can be included Central Processing Units (CPUs), Graphics Processing Units (GPUs), Digital Signal Processing (DSP) boards, Field Programmable Gate Array (FPGA) and some other mobile platforms with OpenCL. OpenCL allows to run same code on CPUs, GPUs, FPGAs and DSP boards. Heterogeneous computing refers to systems include different kind of devices (CPUs, GPUs, FPGAs and other accelerators). Heterogeneous computing gains performance or reduces power for suitable algorithms on these OpenCL supported devices. In this study, Hamming Distance algorithm has been coded with C++ as a sequential code and has been parallelized a designated method by us with OpenCL. Our OpenCL code has been executed on Nvidia GT430 GPU and Intel Xeon 5650 processor. The OpenCL code implementation demonstrates that speed up to 87 times with parallelization. Also our study differs from other studies, which accelerate iris matching, with regard to ensure heterogeneous computing by using OpenCL.

Download Full-text