Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

Slo-Li Chu; Chih-Chieh Hsiao

doi:10.4018/jghpc.2012070103

Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

International Journal of Grid and High Performance Computing ◽

10.4018/jghpc.2012070103 ◽

2012 ◽

Vol 4 (3) ◽

pp. 48-62

Author(s):

Slo-Li Chu ◽

Chih-Chieh Hsiao

Keyword(s):

General Purpose ◽

Optimization Techniques ◽

Instruction Level Parallelism ◽

Heterogeneous Platforms ◽

Modern Computer ◽

Level Data ◽

Performance Programming ◽

Architectural Characteristics ◽

Level Parallelism

Heterogeneous platforms that are consisted of CPU and add-on streaming processors are widely used in modern computer systems. These add-on processors provide substantially more computation capability and memory bandwidth than conventional multi-cores platforms. General-purpose computations can also be leveraged onto these add-on processors. In order to utilize their potential performance, programming these streaming processors is challenging because of their diverse underlying architectural characteristics. Several optimization techniques are applied on OpenCL-compatible heterogeneous platforms to achieve thread-level, data-level, and instruction-level parallelism. The architectural implications of these techniques and optimization principles are discussed. Finally, a case study of MRI-Q benchmark will be addressed to illustrate to capabilities of these optimization techniques. The experimental results reveal the speedup from non-optimized to optimized kernel can vary from 8 to 63 on different target platforms.

Download Full-text

FuMicro: A Fused Microarchitecture Design Integrating In-Order Superscalar and VLIW

VLSI Design ◽

10.1155/2016/8787919 ◽

2016 ◽

Vol 2016 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Yumin Hou ◽

Hu He ◽

Xu Yang ◽

Deyuan Guo ◽

Xu Wang ◽

...

Keyword(s):

Digital Signal ◽

General Purpose ◽

Instruction Level Parallelism ◽

Instruction Set ◽

Mode Switch ◽

Development Environment ◽

General Purpose Processor ◽

Improve Instruction ◽

Library Function ◽

Level Parallelism

This paper proposes FuMicro, a fused microarchitecture integrating both in-order superscalar and Very Long Instruction Word (VLIW) in a single core. A processor with FuMicro microarchitecture can work under alternative in-order superscalar and VLIW mode, using the same pipeline and the same Instruction Set Architecture (ISA). Small modification to the compiler is made to expand the register file in VLIW mode. The decision of mode switch is made by software, and this does not need extra hardware. VLIW code can be exploited in the form of library function and the users will be exposed under only superscalar mode; by this means, we can provide the users with a convenient development environment. FuMicro could serve as a universal microarchitecture for it can be applied to different ISAs. In this paper, we focus on the implementation of FuMicro with ARM ISA. This architecture is evaluated on gem5, which is a cycle accurate microarchitecture simulation platform. By adopting FuMicro microarchitecture, the performance can be improved on an average of 10%, with the best performance improvement being 47.3%, compared with that under pure in-order superscalar mode. The result shows that FuMicro microarchitecture can improve Instruction Level Parallelism (ILP) significantly, making it promising to expand digital signal processing capability on a General Purpose Processor.

Download Full-text

Design and Development of Stream Processor Architecture for GPU Application Using Reconfigurable Computing

International Journal of Reconfigurable and Embedded Systems (IJRES) ◽

10.11591/ijres.v2.i1.pp1-14 ◽

2013 ◽

Vol 2 (1) ◽

pp. 1

Author(s):

Sanket Dessai ◽

Krishna Bhushan Vutukuru

Keyword(s):

Reconfigurable Computing ◽

General Purpose ◽

Instruction Level Parallelism ◽

Stream Processor ◽

Subword Parallelism ◽

Host Processor ◽

Graphical Processing ◽

Processor Unit ◽

Pipelined Multiplier ◽

Level Parallelism

Graphical Processing Units (GPUs) have become an integral part of today’s mainstream computing systems. They are also being used as reprogrammable General Purpose GPUs (GP-GPUs) to perform complex scientific computations. Reconfigurability is an attractive approach to embedded systems allowing hardware level modification. Hence, there is a high demand for GPU designs based on reconfigurable hardware. Stream processor consists of clusters of functional units which provide a bandwidth hierarchy, supporting hundreds of arithmetic units. The arithmetic cluster units are designed to exploit instruction level parallelism and subword parallelism within a cluster and data parallelism across the clusters.For decreasing the area and power, a single controller is used to control data flow between clusters and between host processor and GPU. The designed of stream processor unit has been carried out in Verilog on Altera Quartus II and simulated using ModelSim tools. The functionality of the modelled blocks is verified using test inputs in the simulator.The simulated execution time of 8-bit pipelined multiplier is 60 ps and 100 ns for 8-bit pipelined adder while operating at 90 MHz.

Download Full-text

Boosting Parallel Applications Performance on Applying DIM Technique in a Multiprocessing Environment

International Journal of Reconfigurable Computing ◽

10.1155/2011/546962 ◽

2011 ◽

Vol 2011 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Mateus B. Rutzig ◽

Antonio C. S. Beck ◽

Felipe Madruga ◽

Marco A. Alves ◽

Henrique C. Freitas ◽

...

Keyword(s):

General Purpose ◽

Parallel Applications ◽

Instruction Level Parallelism ◽

Great Level ◽

Embedded Processor ◽

Wide Range ◽

Thread Level Parallelism ◽

Multiprocessing Systems ◽

Performance Gains ◽

Level Parallelism

Limits of instruction-level parallelism and higher transistor density sustain the increasing need for multiprocessor systems: they are rapidly taking over both general-purpose and embedded processor domains. Current multiprocessing systems are composed either of many homogeneous and simple cores or of complex superscalar, simultaneous multithread processing elements. As parallel applications are becoming increasingly present in embedded and general-purpose domains and multiprocessing systems must handle a wide range of different application classes, there is no consensus over which are the best hardware solutions to better exploit instruction-level parallelism (TLP) and thread-level parallelism (TLP) together. Therefore, in this work, we have expanded the DIM (dynamic instruction merging) technique to be used in a multiprocessing scenario, proving the need for an adaptable ILP exploitation even in TLP architectures. We have successfully coupled a dynamic reconfigurable system to an SPARC-based multiprocessor and obtained performance gains of up to 40%, even for applications that show a great level of parallelism at thread level.

Download Full-text

Improving ILP via Fused In-Order Superscalar and VLIW Instruction Dispatch Methods

Journal of Circuits System and Computers ◽

10.1142/s0218126619500208 ◽

2018 ◽

Vol 28 (02) ◽

pp. 1950020 ◽

Cited By ~ 1

Author(s):

Yumin Hou ◽

Xu Wang ◽

Jiawei Fu ◽

Junping Ma ◽

Hu He ◽

...

Keyword(s):

Prediction Method ◽

Digital Signal ◽

General Purpose ◽

Performance Comparison ◽

Instruction Level Parallelism ◽

Superscalar Processor ◽

Performance Improvements ◽

General Purpose Processor ◽

Evaluation Board ◽

Level Parallelism

In order to expand the computation capability of digital signal processing on a General Purpose Processor (GPP), we propose a fused microarchitecture that improves Instruction Level Parallelism (ILP) by supporting both in-order superscalar and very long instruction word (VLIW) dispatch methods in a single pipeline. This design is based on ARMv7-A&R Instruction Set Architecture (ISA). To provide a performance comparison, we first design an in-order superscalar processor, considering that ARM GPPs always adopt superscalar approaches. And then we expand VLIW dispatch method based on this processor, to realize the fused microarchitecture. The two designs are both evaluated on the Xilinx 7-series FPGA (XC7K325T-2FFG900C), using Xilinx Vivado design suite. The results show that, compared with the superscalar processor, the processor working under VLIW mode can improve the performance by 15% and 8%, respectively, when running EEMBC and DSPstone benchmarks. We also run the two benchmarks on ARM Cortex-A9 processor, which is integrated in the Zynq-7000 AP SoC device on Xilinx ZC706 evaluation board. The processor in VLIW mode shows 44% and 30% performance improvements than ARM Cortex-A9. The fused microarchitecture adopts a combined bimodal and PAp branch prediction method. This method achieves 93.7% prediction accuracy with limited hardware overhead.

Download Full-text

Software Carry-Save: A Case Study for Instruction-Level Parallelism

Lecture Notes in Computer Science - Parallel Computing Technologies ◽

10.1007/978-3-540-45145-7_18 ◽

2003 ◽

pp. 207-214 ◽

Cited By ~ 1

Author(s):

David Defour ◽

Florent de Dinechin

Keyword(s):

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

Implementing and Optimizing of Entire System Toolkit of VLIW DSP Processors for Embedded Sensor-Based Systems

Scientific Programming ◽

10.1155/2015/507896 ◽

2015 ◽

Vol 2015 ◽

pp. 1-7

Author(s):

Xu Yang ◽

Mingbin Zeng ◽

Yanjun Zhang

Keyword(s):

State Of The Art ◽

Optimization Techniques ◽

Instruction Level Parallelism ◽

Entire System ◽

Software Toolkit ◽

Compiler Framework ◽

Vliw Dsp ◽

Dsp Processors ◽

Level Parallelism ◽

Develop System

VLIW DSPs can largely enhance the Instruction-Level Parallelism, providing the capacity to meet the performance and energy efficiency requirement of sensor-based systems. However, the exploiting of VLIW DSPs in sensor-based domain has imposed a heavy challenge on software toolkit design. In this paper, we present our methods and experiences to develop system toolkit flows for a VLIW DSP, which is designed dedicated to sensor-based systems. Our system toolkit includes compiler, assembler, linker, debugger, and simulator. We have presented our experimental results in the compiler framework by incorporating several state-of-the-art optimization techniques for this VLIW DSP. The results indicate that our framework can largely enhance the performance and energy consumption against the code generated without it.

Download Full-text

Authoritarianism and anonymity

ACM SIGCAS Computers and Society ◽

10.1145/3447913.3447926 ◽

2021 ◽

Vol 49 (3) ◽

pp. 19-27

Author(s):

Michael Heron ◽

Pauline Belford

Keyword(s):

Computer Ethics ◽

Teaching Tool ◽

Academic Context ◽

Modern Computer ◽

Culture Of Fear ◽

Multiple Groups ◽

The University ◽

Witch Hunts

The Scandal in Academia [32] [33] [34] [35] is an extended fictional case-study intended for use as a teaching and discussion aid for educational practitioners looking to introduce elements of computer ethics into their curricula. Inspired by Epstein [17] [18] it is a full-cycle scenario involving many individuals which touches upon the complexity and interrelations of modern computer ethics. It has been trailed and evaluated as a teaching tool by the authors [36] and with multiple groups since then. However its utility as a general resource is limited without the academic context that supports deeper investigation of the material. It is to address this issue that the authors offer this commentary on the Scandal, with a focus on the ninth and tenth newspaper items presented within. Specifically these are Culture of Fear and Nepotism at University and Witch-Hunts at the University - IT Crackdown Causes Criticisms.

Download Full-text

Lightweight Blockchain Processing. Case Study: Scanned Document Tracking on Tezos Blockchain

Applied Sciences ◽

10.3390/app11157169 ◽

2021 ◽

Vol 11 (15) ◽

pp. 7169

Author(s):

Mohamed Allouche ◽

Tarek Frikha ◽

Mihai Mitrea ◽

Gérard Memmi ◽

Faten Chaabane

Keyword(s):

Load Balancing ◽

Relative Error ◽

Execution Time ◽

General Purpose ◽

Experimental Results ◽

Raspberry Pi ◽

Embedded Platform ◽

Memory Resources ◽

Processing Solution

To bridge the current gap between the Blockchain expectancies and their intensive computation constraints, the present paper advances a lightweight processing solution, based on a load-balancing architecture, compatible with the lightweight/embedding processing paradigms. In this way, the execution of complex operations is securely delegated to an off-chain general-purpose computing machine while the intimate Blockchain operations are kept on-chain. The illustrations correspond to an on-chain Tezos configuration and to a multiprocessor ARM embedded platform (integrated into a Raspberry Pi). The performances are assessed in terms of security, execution time, and CPU consumption when achieving a visual document fingerprint task. It is thus demonstrated that the advanced solution makes it possible for a computing intensive application to be deployed under severely constrained computation and memory resources, as set by a Raspberry Pi 3. The experimental results show that up to nine Tezos nodes can be deployed on a single Raspberry Pi 3 and that the limitation is not derived from the memory but from the computation resources. The execution time with a limited number of fingerprints is 40% higher than using a classical PC solution (value computed with 95% relative error lower than 5%).

Download Full-text

Microarchitectural Characterization on a Mobile Workload

Applied Sciences ◽

10.3390/app11031225 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1225

Author(s):

Woohyong Lee ◽

Jiyoung Lee ◽

Bo Kyung Park ◽

R. Young Chul Kim

Keyword(s):

Performance Monitoring ◽

Performance Metrics ◽

Performance Comparison ◽

Instruction Level Parallelism ◽

Data Set ◽

Performance Events ◽

Hardware Performance Counters ◽

On Chip ◽

The Comparative Study ◽

Level Parallelism

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.

Download Full-text

Investigating the Potential of Network Optimization for a Constrained Object Detection Problem

Journal of Imaging ◽

10.3390/jimaging7040064 ◽

2021 ◽

Vol 7 (4) ◽

pp. 64

Author(s):

Tanguy Ophoff ◽

Cédric Gullentops ◽

Kristof Van Beeck ◽

Toon Goedemé

Keyword(s):

Computational Complexity ◽

Object Detection ◽

Network Optimization ◽

Real Life ◽

Optimization Techniques ◽

Training Data ◽

Single Shot ◽

Standard Object ◽

Number Of Classes

Object detection models are usually trained and evaluated on highly complicated, challenging academic datasets, which results in deep networks requiring lots of computations. However, a lot of operational use-cases consist of more constrained situations: they have a limited number of classes to be detected, less intra-class variance, less lighting and background variance, constrained or even fixed camera viewpoints, etc. In these cases, we hypothesize that smaller networks could be used without deteriorating the accuracy. However, there are multiple reasons why this does not happen in practice. Firstly, overparameterized networks tend to learn better, and secondly, transfer learning is usually used to reduce the necessary amount of training data. In this paper, we investigate how much we can reduce the computational complexity of a standard object detection network in such constrained object detection problems. As a case study, we focus on a well-known single-shot object detector, YoloV2, and combine three different techniques to reduce the computational complexity of the model without reducing its accuracy on our target dataset. To investigate the influence of the problem complexity, we compare two datasets: a prototypical academic (Pascal VOC) and a real-life operational (LWIR person detection) dataset. The three optimization steps we exploited are: swapping all the convolutions for depth-wise separable convolutions, perform pruning and use weight quantization. The results of our case study indeed substantiate our hypothesis that the more constrained a problem is, the more the network can be optimized. On the constrained operational dataset, combining these optimization techniques allowed us to reduce the computational complexity with a factor of 349, as compared to only a factor 9.8 on the academic dataset. When running a benchmark on an Nvidia Jetson AGX Xavier, our fastest model runs more than 15 times faster than the original YoloV2 model, whilst increasing the accuracy by 5% Average Precision (AP).

Download Full-text