scholarly journals Resource Efficient Hardware Architecture for Fast Computation of Running Max/Min Filters

2013 ◽  
Vol 2013 ◽  
pp. 1-10 ◽  
Author(s):  
Cesar Torres-Huitzil

Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with ak×kkernel requires ofk2−1comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel sizek. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on1024×1024images with up to255×255kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding.

Electronics ◽  
2019 ◽  
Vol 8 (11) ◽  
pp. 1255
Author(s):  
Elyas Zamiri ◽  
Alberto Sanchez ◽  
Angel de Castro ◽  
Maria Sofia Martínez-García

Nowadays, the Hardware-In-the-Loop (HIL) technique is widely used to test different power electronic converters. These real-time simulations require processing large data at high speed, which makes this application very suitable for FPGAs (Field Programmable Gate Array) as they are capable of parallel processing. This paper provides an analytical discussion on three HIL models for a full-bridge converter. The three models use different possible numerical formats, namely float and fixed-point, the latter with and without optimizing the width of signals to the embedded DSP (Digital Signal Processors) blocks of the FPGA. The optimized fixed-point model (OFPM) uses three and two times fewer DSP blocks or LUTs (Look Up Tables), and the maximum achievable clock frequency is also up to 35 % and 25 % higher than the float model and non-optimized fixed-point model (nOFPM), respectively. Furthermore, the models’ accuracy is proportional to the clock frequency, thus the OFPM is also the most accurate model. Finally, the paper shows the differences in the simulation when the models include or not losses, proving that not including losses leads to high errors, especially during transients.


Electronics ◽  
2021 ◽  
Vol 10 (8) ◽  
pp. 884
Author(s):  
Stefano Rossi ◽  
Enrico Boni

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.


Author(s):  
Ibrahem M. T. Hamidi ◽  
Farah S. H. Al-aassi

Aim: Achieve high throughput 128 bits FPGA based Advanced Encryption Standard. Background: Field Programmable Gate Array (FPGA) provides an efficient platform for design AES cryptography system. It provides the capability to control over each bit using HDL programming language such as VHDL and Verilog which results an output speed in Gbps rang. Objective: Use Field Programmable Gate Array (FPGA) to design high throughput 128 bits FPGA based Advanced Encryption Standard. Method: Pipelining technique has used to achieve maximum possible speed. The level of pipelining includes round pipelining and internal component pipelining where number of registers inserted in particular places to increase the output speed. The proposed design uses combinatorial logic to implement the byte substitution. The s-box implemented using composed field arithmetic with 7 stages of pipelining to reduce the combinatorial logic level. The presented model has implemented using VHDL in Xilinix ISETM 14.4 design tool. Result: The achieved results were 18.55 Gbps at a clock frequency of 144.96 MHz and area of 1568 Slices in Spartan3 xc3s1000 hardware. Conclusion: The results show that the proposed design reaches a high throughput with acceptable area usage compare with other designs in the literature.


2014 ◽  
Vol 989-994 ◽  
pp. 3851-3855
Author(s):  
Guang Jin Lai

Digital X-ray photography technology is under the control of the computer, to use one-dimensional or 2D X-ray detector to convert the captured image into digital signals directly to using image processing technology. It can realize the function of image analysis. We introduce X-ray photography technology into the terminal identification in track and field, and use the clustering algorithm to improve computer image clustering algorithm. Through capturing the digital signal of human head, arms and legs, it enhances the terminal recognition method in track and field. Finally we use MATLAB to calculate the captured image value of X-ray photography. Through calculation, motion capture and recognition of X-ray image are enhanced obviously. It provides a theoretical basis for researching on motion capture technology in track and field.


2021 ◽  
Vol 27 (3) ◽  
pp. 57-70
Author(s):  
Damjan M. Rakanovic ◽  
Vuk Vranjkovic ◽  
Rastislav J. R. Struharik

Paper proposes a two-step Convolutional Neural Network (CNN) pruning algorithm and resource-efficient Field-programmable gate array (FPGA) CNN accelerator named “Argus”. The proposed CNN pruning algorithm first combines similar kernels into clusters, which are then pruned using the same regular pruning pattern. The pruning algorithm is carefully tailored for FPGAs, considering their resource characteristics. Regular sparsity results in high Multiply-accumulate (MAC) efficiency, reducing the amount of logic required to balance workloads among different MAC units. As a result, the Argus accelerator requires about 170 Look-up tables (LUTs) per Digital Signal Processor (DSP) block. This number is close to the average LUT/DPS ratio for various FPGA families, enabling balanced resource utilization when implementing Argus. Benchmarks conducted using Xilinx Zynq Ultrascale + Multi-Processor System-on-Chip (MPSoC) indicate that Argus is achieving up to 25 times higher frames per second than NullHop, 2 and 2.5 times higher than NEURAghe and Snowflake, respectively, and 2 times higher than NVDLA. Argus shows comparable performance to MIT’s Eyeriss v2 and Caffeine, requiring up to 3 times less memory bandwidth and utilizing 4 times fewer DSP blocks, respectively. Besides the absolute performance, Argus has at least 1.3 and 2 times better GOP/s/DSP and GOP/s/Block-RAM (BRAM) ratios, while being competitive in terms of GOP/s/LUT, compared to some of the state-of-the-art solutions.


Author(s):  
Christopher Wing Hong Ngau ◽  
Li-Minn Ang ◽  
Kah Phooi Seng

Studies in the area of computational vision have shown the capability of visual attention (VA) processing in aiding various visual tasks by providing a means for simplifying complex data handling and supporting action decisions using readily available low-level features. Due to the inclusion of computational biological vision components to mimic the mechanism of the human visual system, VA processing is computationally complex with heavy memory requirements and is often found implemented in workstations with unapplied resource constraints. In embedded systems, the computational capacity and memory resources are of a primary concern. To allow VA processing in such systems, the chapter presents a low complexity, low memory VA model based on an established mainstream VA model that addresses critical factors in terms of algorithm complexity, memory requirements, computational speed, and salience prediction performance to ensure the reliability of the VA processing in an environment with limited resources. Lastly, a custom softcore microprocessor-based hardware implementation on a Field-Programmable Gate Array (FPGA) is used to verify the implementation feasibility of the presented low complexity, low memory VA model.


Electronics ◽  
2019 ◽  
Vol 8 (6) ◽  
pp. 641 ◽  
Author(s):  
Miguel Rivera-Acosta ◽  
Susana Ortega-Cisneros ◽  
Jorge Rivera

This paper presents a platform that automatically generates custom hardware accelerators for convolutional neural networks (CNNs) implemented in field-programmable gate array (FPGA) devices. It includes a user interface for configuring and managing these accelerators. The herein-presented platform can perform all the processes necessary to design and test CNN accelerators from the CNN architecture description at both layer and internal parameter levels, training the desired architecture with any dataset and generating the configuration files required by the platform. With these files, it can synthesize the register-transfer level (RTL) and program the customized CNN accelerator into the FPGA device for testing, making it possible to generate custom CNN accelerators quickly and easily. All processes save the CNN architecture description are fully automatized and carried out by the platform, which manages third-party software to train the CNN and synthesize and program the generated RTL. The platform has been tested with the implementation of some of the CNN architectures found in the state-of-the-art for freely available datasets such as MNIST, CIFAR-10, and STL-10.


2016 ◽  
Vol 850 ◽  
pp. 129-135
Author(s):  
Buğra Şimşek ◽  
Nursel Akçam

This study presents parallelization of Hamming Distance algorithm, which is used for iris comparison on iris recognition systems, for heterogeneous systems that can be included Central Processing Units (CPUs), Graphics Processing Units (GPUs), Digital Signal Processing (DSP) boards, Field Programmable Gate Array (FPGA) and some other mobile platforms with OpenCL. OpenCL allows to run same code on CPUs, GPUs, FPGAs and DSP boards. Heterogeneous computing refers to systems include different kind of devices (CPUs, GPUs, FPGAs and other accelerators). Heterogeneous computing gains performance or reduces power for suitable algorithms on these OpenCL supported devices. In this study, Hamming Distance algorithm has been coded with C++ as a sequential code and has been parallelized a designated method by us with OpenCL. Our OpenCL code has been executed on Nvidia GT430 GPU and Intel Xeon 5650 processor. The OpenCL code implementation demonstrates that speed up to 87 times with parallelization. Also our study differs from other studies, which accelerate iris matching, with regard to ensure heterogeneous computing by using OpenCL.


Sign in / Sign up

Export Citation Format

Share Document