A Hybrid Scheme Based on Pipelining and Multitasking in Mobile Application Processors for Advanced Video Coding

Scientific Programming ◽

10.1155/2015/197843 ◽

2015 ◽

Vol 2015 ◽

pp. 1-16

Author(s):

Muhammad Asif ◽

Imtiaz A. Taj ◽

S. M. Ziauddin ◽

Maaz Bin Ahmad ◽

M. Tahir

Keyword(s):

Video Processing ◽

Mobile Application ◽

High Performance ◽

Optimization Techniques ◽

Processing Unit ◽

Software Modules ◽

Computationally Intensive ◽

Hardware Processing ◽

Memory Accesses ◽

Advanced Video Coding

One of the key requirements for mobile devices is to provide high-performance computing at lower power consumption. The processors used in these devices provide specific hardware resources to handle computationally intensive video processing and interactive graphical applications. Moreover, processors designed for low-power applications may introduce limitations on the availability and usage of resources, which present additional challenges to the system designers. Owing to the specific design of the JZ47x series of mobile application processors, a hybrid software-hardware implementation scheme for H.264/AVC encoder is proposed in this work. The proposed scheme distributes the encoding tasks among hardware and software modules. A series of optimization techniques are developed to speed up the memory access and data transferring among memories. Moreover, an efficient data reusage design is proposed for the deblock filter video processing unit to reduce the memory accesses. Furthermore, fine grained macroblock (MB) level parallelism is effectively exploited and a pipelined approach is proposed for efficient utilization of hardware processing cores. Finally, based on parallelism in the proposed design, encoding tasks are distributed between two processing cores. Experiments show that the hybrid encoder is 12 times faster than a highly optimized sequential encoder due to proposed techniques.

Download Full-text

Real-Time Image and Video Processing Using High-Level Synthesis (HLS)

Advances in Multimedia and Interactive Technologies - Handbook of Research on Advanced Concepts in Real-Time Image and Video Processing ◽

10.4018/978-1-5225-2848-7.ch015 ◽

2017 ◽

pp. 390-408

Author(s):

Murad Qasaimeh ◽

Ehab Najeh Salahat

Keyword(s):

Video Processing ◽

High Performance ◽

Low Cost ◽

Optimization Techniques ◽

High Level Synthesis ◽

Image And Video Processing ◽

Digital Hardware ◽

Processing Algorithms ◽

Computationally Intensive ◽

High Level

Implementing high-performance, low-cost hardware accelerators for the computationally intensive image and video processing algorithms has attracted a lot of attention in the last 20 years. Most of the recent research efforts were trying to figure out new design automation methods to fill the gap between the ability of realizing efficient accelerators in hardware and the tight performance requirements of the complex image processing algorithms. High-Level synthesis (HLS) is a new method to automate the design process by transforming high-level algorithmic description into digital hardware while satisfying the design constraints. This chapter focuses on evaluating the suitability of using HLS as a new tool to accelerate the most demanding image and video processing algorithms in hardware. It discusses the gained benefits and current limitations, the recent academic and commercial tools, the compiler's optimization techniques and four case studies.

Download Full-text

Real-Time Image and Video Processing Using High-Level Synthesis (HLS)

Computer Vision ◽

10.4018/978-1-5225-5204-8.ch042 ◽

2018 ◽

pp. 1004-1022

Author(s):

Murad Qasaimeh ◽

Ehab Najeh Salahat

Keyword(s):

Video Processing ◽

High Performance ◽

Low Cost ◽

Optimization Techniques ◽

High Level Synthesis ◽

Image And Video Processing ◽

Digital Hardware ◽

Processing Algorithms ◽

Computationally Intensive ◽

High Level

Download Full-text

IOb-Cache: A High-Performance Configurable Open-Source Cache

Algorithms ◽

10.3390/a14080218 ◽

2021 ◽

Vol 14 (8) ◽

pp. 218

Author(s):

João V. Roque ◽

João D. Lopes ◽

Mário P. Véstias ◽

José T. de Sousa

Keyword(s):

Open Source ◽

High Performance ◽

System On Chip ◽

Processing Unit ◽

Central Processing ◽

Front End ◽

Front End Module ◽

Memory Accesses ◽

On Chip ◽

Access Policies

Open-source processors are increasingly being adopted by the industry, which requires all sorts of open-source implementations of peripherals and other system-on-chip modules. Despite the recent advent of open-source hardware, the available open-source caches have low configurability, limited lack of support for single-cycle pipelined memory accesses, and use non-standard hardware interfaces. In this paper, the IObundle cache (IOb-Cache), a high-performance configurable open-source cache is proposed, developed and deployed. The cache has front-end and back-end modules for fast integration with processors and memory controllers. The front-end module supports the native interface, and the back-end module supports the native interface and the standard Advanced eXtensible Interface (AXI). The cache is highly configurable in structure and access policies. The back-end can be configured to read bursts of multiple words per transfer to take advantage of the available memory bandwidth. To the best of our knowledge, IOb-Cache is currently the only configurable cache that supports pipelined Central Processing Unit (CPU) interfaces and AXI memory bus interface. Additionally, it has a write-through buffer and an independent controller for fast, most of the time 1-cycle writing together with 1-cycle reading, while previous works only support 1-cycle reading. This allows the best clocks-per-Instruction (CPI) to be close to one (1.055). IOb-Cache is integrated into IOb System-on-Chip (IOb-SoC) Github repository, which has 29 stars and is already being used in 50 projects (forks).

Download Full-text

GPU-Based Soil Parameter Parallel Inversion for PolSAR Data

Remote Sensing ◽

10.3390/rs12030415 ◽

2020 ◽

Vol 12 (3) ◽

pp. 415 ◽

Cited By ~ 1

Author(s):

Qiang Yin ◽

You Wu ◽

Fan Zhang ◽

Yongsheng Zhou

Keyword(s):

Data Storage ◽

High Performance ◽

Graphics Processing Unit ◽

Processing Unit ◽

Inversion Process ◽

Soil Parameter ◽

Parameter Inversion ◽

Level Data ◽

Computationally Intensive ◽

Graphics Processing

With the development of polarimetric synthetic aperture radar (PolSAR), quantitative parameter inversion has been seen great progress, especially in the field of soil parameter inversion, which has achieved good results for applications. However, PolSAR data is also often many terabytes large. This huge amount of data also directly affects the efficiency of the inversion. Therefore, the efficiency of soil moisture and roughness inversion has become a problem in the application of this PolSAR technique. A parallel realization based on a graphics processing unit (GPU) for multiple inversion models of PolSAR data is proposed in this paper. This method utilizes the high-performance parallel computing capability of a GPU to optimize the realization of the surface inversion models for polarimetric SAR data. Three classical forward scattering models and their corresponding inversion algorithms are analyzed. They are different in terms of polarimetric data requirements, application situation, as well as inversion performance. Specifically, the inversion process of PolSAR data is mainly improved by the use of the high concurrent threads of GPU. According to the inversion process, various optimization strategies are applied, such as the parallel task allocation, and optimizations of instruction level, data storage, data transmission between CPU and GPU. The advantages of a GPU in processing computationally-intensive data are shown in the data experiments, where the efficiency of soil roughness and moisture inversion is increased by one or two orders of magnitude.

Download Full-text

GPU Computation and Platforms

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch007 ◽

2016 ◽

pp. 136-174

Author(s):

K. Bhargavi ◽

Sathish Babu B.

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Gpu Computing ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Computing Platforms ◽

Computationally Intensive ◽

Graphics Processing

The GPUs (Graphics Processing Unit) were mainly used to speed up computation intensive high performance computing applications. There are several tools and technologies available to perform general purpose computationally intensive application. This chapter primarily discusses about GPU parallelism, applications, probable challenges and also highlights some of the GPU computing platforms, which includes CUDA, OpenCL (Open Computing Language), OpenMPC (Open MP extended for CUDA), MPI (Message Passing Interface), OpenACC (Open Accelerator), DirectCompute, and C++ AMP (C++ Accelerated Massive Parallelism). Each of these platforms is discussed briefly along with their advantages and disadvantages.

Download Full-text

Implementation of Special Function Unit for Vertex Shader Processor Using Hybrid Number System

Journal of Computer Networks and Communications ◽

10.1155/2014/890354 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7

Author(s):

Avni Agarwal ◽

P. Harsha ◽

Swati Vasishta ◽

S. Sivanantham

Keyword(s):

High Performance ◽

Special Function ◽

Low Cost ◽

Number System ◽

Binary Number ◽

Processing Unit ◽

Function Unit ◽

Computationally Intensive ◽

Standard Documents ◽

Hybrid Number

The world of 3D graphic computing has undergone a revolution in the recent past, making devices more computationally intensive, providing high-end imaging to the user. The OpenGL ES Standard documents the requirements of graphic processing unit. A prime feature of this standard is a special function unit (SFU), which performs all the required mathematical computations on the vertex information corresponding to the image. This paper presents a low-cost, high-performance SFU architecture with improved speed and reduced area. Hybrid number system is employed here in order to reduce the complexity of operations by suitably switching between logarithmic number system (LNS) and binary number system (BNS). In this work, reduction of area and a higher operating frequency are achieved with almost the same power consumption as that of the existing implementations.

Download Full-text

A High Performance Image Processing Unit for On-orbit Servicing

57th International Astronautical Congress ◽

10.2514/6.iac-06-d1.2.03 ◽

2006 ◽

Cited By ~ 1

Author(s):

Hiroshi Yamamoto ◽

Yasufumi Nagai ◽

Shinichi Kimura ◽

Hiroshi Takahashi ◽

Satoko Mizumoto ◽

...

Keyword(s):

Image Processing ◽

High Performance ◽

Processing Unit

Download Full-text

Flexible 5G New Radio LDPC Encoder Optimized for High Hardware Usage Efficiency

Electronics ◽

10.3390/electronics10091106 ◽

2021 ◽

Vol 10 (9) ◽

pp. 1106

Author(s):

Vladimir L. Petrović ◽

Dragomir M. El Mezeni ◽

Andreja Radošević

Keyword(s):

Channel Coding ◽

Parallel Architecture ◽

Ldpc Codes ◽

Parity Check ◽

Case Scenario ◽

New Radio ◽

Physical Channel ◽

Computationally Intensive ◽

Hardware Processing ◽

Usage Efficiency

Quasi-cyclic low-density parity-check (QC–LDPC) codes are introduced as a physical channel coding solution for data channels in 5G new radio (5G NR). Depending on the use case scenario, this standard proposes the usage of a wide variety of codes, which imposes the need for high encoder flexibility. LDPC codes from 5G NR have a convenient structure and can be efficiently encoded using forward substitution and without computationally intensive multiplications with dense matrices. However, the state-of-the-art solutions for encoder hardware implementation can be inefficient since many hardware processing units stay idle during the encoding process. This paper proposes a novel partially parallel architecture that can provide high hardware usage efficiency (HUE) while achieving encoder flexibility and support for all 5G NR codes. The proposed architecture includes a flexible circular shifting network, which is capable of shifting a single large bit vector or multiple smaller bit vectors depending on the code. The encoder architecture was built around the shifter in a way that multiple parity check matrix elements can be processed in parallel for short codes, thus providing almost the same level of parallelism as for long codes. The processing schedule was optimized for minimal encoding time using the genetic algorithm. The optimized encoder provided high throughputs, low latency, and up-to-date the best HUE.

Download Full-text

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text

Enabling Large-Scale Simulations of Quantum Transport with Manycore Computing

Electronics ◽

10.3390/electronics10030253 ◽

2021 ◽

Vol 10 (3) ◽

pp. 253

Author(s):

Yosang Jeong ◽

Hoon Ryu

Keyword(s):

Quantum Transport ◽

Large Scale ◽

Performance Enhancement ◽

Silicon Nanowire ◽

Matrix Multiplication ◽

Tight Binding ◽

Optimization Techniques ◽

Wide Energy Range ◽

Processing Unit ◽

Binding Model

The non-equilibrium Green’s function (NEGF) is being utilized in the field of nanoscience to predict transport behaviors of electronic devices. This work explores how much performance improvement can be driven for quantum transport simulations with the aid of manycore computing, where the core numerical operation involves a recursive process of matrix multiplication. Major techniques adopted for performance enhancement are data restructuring, matrix tiling, thread scheduling, and offload computing, and we present technical details on how they are applied to optimize the performance of simulations in computing hardware, including Intel Xeon Phi Knights Landing (KNL) systems and NVIDIA general purpose graphic processing unit (GPU) devices. With a target structure of a silicon nanowire that consists of 100,000 atoms and is described with an atomistic tight-binding model, the effects of optimization techniques on the performance of simulations are rigorously tested in a KNL node equipped with two Quadro GV100 GPU devices, and we observe that computation is accelerated by a factor of up to ∼20 against the unoptimized case. The feasibility of handling large-scale workloads in a huge computing environment is also examined with nanowire simulations in a wide energy range, where good scalability is procured up to 2048 KNL nodes.

Download Full-text