FuMicro: A Fused Microarchitecture Design Integrating In-Order Superscalar and VLIW

VLSI Design ◽

10.1155/2016/8787919 ◽

2016 ◽

Vol 2016 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Yumin Hou ◽

Hu He ◽

Xu Yang ◽

Deyuan Guo ◽

Xu Wang ◽

...

Keyword(s):

Digital Signal ◽

General Purpose ◽

Instruction Level Parallelism ◽

Instruction Set ◽

Mode Switch ◽

Development Environment ◽

General Purpose Processor ◽

Improve Instruction ◽

Library Function ◽

Level Parallelism

This paper proposes FuMicro, a fused microarchitecture integrating both in-order superscalar and Very Long Instruction Word (VLIW) in a single core. A processor with FuMicro microarchitecture can work under alternative in-order superscalar and VLIW mode, using the same pipeline and the same Instruction Set Architecture (ISA). Small modification to the compiler is made to expand the register file in VLIW mode. The decision of mode switch is made by software, and this does not need extra hardware. VLIW code can be exploited in the form of library function and the users will be exposed under only superscalar mode; by this means, we can provide the users with a convenient development environment. FuMicro could serve as a universal microarchitecture for it can be applied to different ISAs. In this paper, we focus on the implementation of FuMicro with ARM ISA. This architecture is evaluated on gem5, which is a cycle accurate microarchitecture simulation platform. By adopting FuMicro microarchitecture, the performance can be improved on an average of 10%, with the best performance improvement being 47.3%, compared with that under pure in-order superscalar mode. The result shows that FuMicro microarchitecture can improve Instruction Level Parallelism (ILP) significantly, making it promising to expand digital signal processing capability on a General Purpose Processor.

Download Full-text

Improving ILP via Fused In-Order Superscalar and VLIW Instruction Dispatch Methods

Journal of Circuits System and Computers ◽

10.1142/s0218126619500208 ◽

2018 ◽

Vol 28 (02) ◽

pp. 1950020 ◽

Cited By ~ 1

Author(s):

Yumin Hou ◽

Xu Wang ◽

Jiawei Fu ◽

Junping Ma ◽

Hu He ◽

...

Keyword(s):

Prediction Method ◽

Digital Signal ◽

General Purpose ◽

Performance Comparison ◽

Instruction Level Parallelism ◽

Superscalar Processor ◽

Performance Improvements ◽

General Purpose Processor ◽

Evaluation Board ◽

Level Parallelism

In order to expand the computation capability of digital signal processing on a General Purpose Processor (GPP), we propose a fused microarchitecture that improves Instruction Level Parallelism (ILP) by supporting both in-order superscalar and very long instruction word (VLIW) dispatch methods in a single pipeline. This design is based on ARMv7-A&R Instruction Set Architecture (ISA). To provide a performance comparison, we first design an in-order superscalar processor, considering that ARM GPPs always adopt superscalar approaches. And then we expand VLIW dispatch method based on this processor, to realize the fused microarchitecture. The two designs are both evaluated on the Xilinx 7-series FPGA (XC7K325T-2FFG900C), using Xilinx Vivado design suite. The results show that, compared with the superscalar processor, the processor working under VLIW mode can improve the performance by 15% and 8%, respectively, when running EEMBC and DSPstone benchmarks. We also run the two benchmarks on ARM Cortex-A9 processor, which is integrated in the Zynq-7000 AP SoC device on Xilinx ZC706 evaluation board. The processor in VLIW mode shows 44% and 30% performance improvements than ARM Cortex-A9. The fused microarchitecture adopts a combined bimodal and PAp branch prediction method. This method achieves 93.7% prediction accuracy with limited hardware overhead.

Download Full-text

Design and Development of Stream Processor Architecture for GPU Application Using Reconfigurable Computing

International Journal of Reconfigurable and Embedded Systems (IJRES) ◽

10.11591/ijres.v2.i1.pp1-14 ◽

2013 ◽

Vol 2 (1) ◽

pp. 1

Author(s):

Sanket Dessai ◽

Krishna Bhushan Vutukuru

Keyword(s):

Reconfigurable Computing ◽

General Purpose ◽

Instruction Level Parallelism ◽

Stream Processor ◽

Subword Parallelism ◽

Host Processor ◽

Graphical Processing ◽

Processor Unit ◽

Pipelined Multiplier ◽

Level Parallelism

Graphical Processing Units (GPUs) have become an integral part of today’s mainstream computing systems. They are also being used as reprogrammable General Purpose GPUs (GP-GPUs) to perform complex scientific computations. Reconfigurability is an attractive approach to embedded systems allowing hardware level modification. Hence, there is a high demand for GPU designs based on reconfigurable hardware. Stream processor consists of clusters of functional units which provide a bandwidth hierarchy, supporting hundreds of arithmetic units. The arithmetic cluster units are designed to exploit instruction level parallelism and subword parallelism within a cluster and data parallelism across the clusters.For decreasing the area and power, a single controller is used to control data flow between clusters and between host processor and GPU. The designed of stream processor unit has been carried out in Verilog on Altera Quartus II and simulated using ModelSim tools. The functionality of the modelled blocks is verified using test inputs in the simulator.The simulated execution time of 8-bit pipelined multiplier is 60 ps and 100 ns for 8-bit pipelined adder while operating at 90 MHz.

Download Full-text

The specialization of general purpose processor architecture elements for programmable digital signal processors

Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040) ◽

10.1109/iccd.1999.808589 ◽

2003 ◽

Author(s):

D. Steiss

Keyword(s):

Digital Signal ◽

General Purpose ◽

Digital Signal Processors ◽

Processor Architecture ◽

General Purpose Processor ◽

Signal Processors

Download Full-text

Rapid Composition for Networked Devices: HappyBrackets

Computer Music Journal ◽

10.1162/comj_a_00520 ◽

2020 ◽

Vol 43 (2-3) ◽

pp. 89-108

Author(s):

Angelo Fraietta ◽

Oliver Bown ◽

Sam Ferguson ◽

Sam Gillespie ◽

Liam Bray

Keyword(s):

Digital Signal ◽

General Purpose ◽

Networked Systems ◽

Programming Environment ◽

Input Output ◽

Development Environment ◽

Integrated Development ◽

And Performance ◽

Creative Coding ◽

Compiler Techniques

This article introduces an open-source Java-based programming environment for creative coding of agglomerative systems using Internet-of-Things (IoT) technologies. Our software originally focused on digital signal processing of audio—including synthesis, sampling, granular sample playback, and a suite of basic effects—but composers now use it to interface with sensors and peripherals through general-purpose input/output and external networked systems. This article examines and addresses the strategies required to integrate novel embedded musical interfaces and creative coding paradigms through an IoT infrastructure. These include: the use of advanced tooling features of a professional integrated development environment as a composition or performance interface rather than just as a compiler; techniques to create media works using features such as autodetection of sensors; seamless and serverless communication among devices on the network; and uploading, updating, and running of new compositions to the device without interruption. Furthermore, we examined the difficulties many novice programmers experience when learning to write code, and we developed strategies to address these difficulties without restricting the potential available in the coding environment. We also examined and developed methods to monitor and debug devices over the network, allowing artists and programmers to set and retrieve current variable values to or from these devices during the performance and composition stages. Finally, we describe three types of art work that demonstrate how the software, called HappyBrackets, is being used in live-coding and dance performances, in interactive sound installations, and as an advanced composition and performance tool for multimedia works.

Download Full-text

Research and Design of Superscalar Pipeline Processor Based on CPLD Technology

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.427-429.2822 ◽

2013 ◽

Vol 427-429 ◽

pp. 2822-2825

Author(s):

Chang Qin Yan ◽

Yan Yan Yu ◽

Qian Huang ◽

Jun Yang

Keyword(s):

Performance Test ◽

Hardware Verification ◽

Advanced Technology ◽

Instruction Level Parallelism ◽

Central Processor ◽

Graphic Accelerator ◽

Superscalar Pipeline ◽

And Performance ◽

Improve Instruction ◽

Level Parallelism

Superscalar pipelining is to improve instruction-level parallelism and advanced technology, and is widely used in the computer's central processor and graphic accelerator. In this paper, we made use of advantage of CPLD devicesinherent flexibility, usability, predictability and so on, to achieve superscalar pipelining, designed and constructed a processor model superscalar pipeline machine based on RISC instruction set. Using EDA technology with top-down design methods, and gave the processor model hardware verification and performance test results, and had explored the use of EDA technology processor design ideas and methods.

Download Full-text

Boosting Parallel Applications Performance on Applying DIM Technique in a Multiprocessing Environment

International Journal of Reconfigurable Computing ◽

10.1155/2011/546962 ◽

2011 ◽

Vol 2011 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Mateus B. Rutzig ◽

Antonio C. S. Beck ◽

Felipe Madruga ◽

Marco A. Alves ◽

Henrique C. Freitas ◽

...

Keyword(s):

General Purpose ◽

Parallel Applications ◽

Instruction Level Parallelism ◽

Great Level ◽

Embedded Processor ◽

Wide Range ◽

Thread Level Parallelism ◽

Multiprocessing Systems ◽

Performance Gains ◽

Level Parallelism

Limits of instruction-level parallelism and higher transistor density sustain the increasing need for multiprocessor systems: they are rapidly taking over both general-purpose and embedded processor domains. Current multiprocessing systems are composed either of many homogeneous and simple cores or of complex superscalar, simultaneous multithread processing elements. As parallel applications are becoming increasingly present in embedded and general-purpose domains and multiprocessing systems must handle a wide range of different application classes, there is no consensus over which are the best hardware solutions to better exploit instruction-level parallelism (TLP) and thread-level parallelism (TLP) together. Therefore, in this work, we have expanded the DIM (dynamic instruction merging) technique to be used in a multiprocessing scenario, proving the need for an adaptable ILP exploitation even in TLP architectures. We have successfully coupled a dynamic reconfigurable system to an SPARC-based multiprocessor and obtained performance gains of up to 40%, even for applications that show a great level of parallelism at thread level.

Download Full-text

PENINGKATAN PERFORMANSI PROSESOR DLX DENGAN METODE PIPELINE

TEKTRIKA - Jurnal Penelitian dan Pengembangan Telekomunikasi Kendali Komputer Elektrik dan Elektronika ◽

10.25124/tektrika.v8i2.230 ◽

2016 ◽

Vol 8 (2) ◽

Author(s):

Maman Abdurohman

Keyword(s):

General Purpose ◽

Memory Access ◽

Instruction Set ◽

Instruction Fetch ◽

General Purpose Processor ◽

Cpu Time ◽

Time Cycle ◽

Reduced Instruction Set Computer

Prosesor DLX adalah sebuah prosesor berbasis RISC (Reduced Instruction Set Computer) yang dirancang sebagai prosesor tujuan umum (general purpose processor). Prosesor ini mempunyai arsitektur load-store dengan panjang semua instruksinya 32 bit. Setiap instruksi dieksekusi dalam beberapa siklus waktu (cycletime). Secara umum time cycle yang digunakan sebanyak lima tahap yang terdiri dari tahap-tahap : Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), dan Write Back (WB). Kelima tahap ini dikerjakan secara berurutan [2]. Sebagai prosesor multicycle, DLX mempunyai peluang untuk meningkatkan kinerjanya yang diukur dengan kecepatan proses yang dinyatkan sebagai waktu CPU (CPU time). Peningkatan kinerja prosesor DLX dapat diterapkan dengan menggunakan teknik pipeline. Pada jurnal ini telah dianalisis peningkatan performansi prosesor DLX dengan menggunakan teknik pipeline. Uji coba dilakukan terhadap beberapa program aplikasi yang dieksekusi dengan menggunakan teknik pipeline dan tanpa menggunakan teknik pipeline. Secara umum terjadi peningkatan kecepatan pada setiap kumpulan instruksi yang dianalisis. Proses pengujian dilakukan dengan menggunakan simulator windlx yang merupakan simulator prosesor DLX.Kata kunci : Prosesor DLX, RISC, general purpose processor, CPU, Pipeline, windlx

Download Full-text

HARDWARE-SOFTWARE COMPLEX FOR DIGITAL PROCESSING OF HYDROACOUSTIC SIGNALS

Issues of radio electronics ◽

10.21778/2218-5453-2019-5-60-63 ◽

2019 ◽

pp. 60-63

Author(s):

S. A. Koltakov ◽

A. A. Cherepnev

Keyword(s):

Signal Processing ◽

High Speed ◽

Digital Signal ◽

Digital Processing ◽

General Purpose ◽

Software Complex ◽

General Purpose Processor ◽

Input And Output ◽

Programming Tools ◽

Modern Methods

The  article  describes  the  hardware‑software  complex  (HSC)  based  on  the  debugging  stand,  its  composition,  modules  and  operations. A method for synthesizing the output signal is described, a formula and a table of parameters for its calculation are given. Signals and spectra at the input and output of the developed HSC are shown. The obtained parameters of the performance of various agribusiness, based on the signal processor with a General‑purpose processor and two variants with General‑purpose processors. The proposed version of the HSC2–3 times wins in performance compared to the HSC based on the general‑ purpose processor of Intel. This is achieved through the use of modern methods and programming tools, digital signal processing modules, as well as the optimization of the executable code. Recommendations for possible further improvement of the proposed complex are given, which is possible due to the use of modern FPGAs and high‑speed interface.

Download Full-text

INSTRUCTION-SET EXTENSION FOR CRYPTOGRAPHIC APPLICATIONS ON RECONFIGURABLE PLATFORM

Journal of Circuits System and Computers ◽

10.1142/s0218126607004076 ◽

2007 ◽

Vol 16 (06) ◽

pp. 911-927

Author(s):

S. MAJZOUB ◽

H. DIAB

Keyword(s):

Reconfigurable Computing ◽

General Purpose ◽

Coarse Grain ◽

Instruction Set ◽

General Purpose Processor ◽

Instruction Set Extension ◽

Custom Hardware ◽

Reconfigurable Platform ◽

And Performance ◽

Bitwise Operations

Reconfigurable Systems represent a middle trade-off between speed and flexibility in the processor design world. It provides performance close to the custom-hardware and yet preserves some of the general-purpose processor flexibility. Recently, the area of reconfigurable computing has received considerable interest in both its forms: the FPGA and coarse-grain hardware. Since the field is still in its developing stage, it is important to perform hardware analysis and evaluation of certain key applications on target reconfigurable architectures to identify potential limitations and improvements. This paper presents the mapping and performance analysis of two encryption algorithms, namely Rijndael and Twofish, on a coarse grain reconfigurable platform, namely MorphoSys. MorphoSys is a reconfigurable architecture targeted for multimedia applications. Since many cryptographic algorithms involve bitwise operations, bitwise instruction set extension was proposed to enhance the performance. We present the details of the mapping of the bitwise operations involved in the algorithms with thorough analysis. The methodology we used can be utilized in other systems.

Download Full-text

Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

International Journal of Grid and High Performance Computing ◽

10.4018/jghpc.2012070103 ◽

2012 ◽

Vol 4 (3) ◽

pp. 48-62

Author(s):

Slo-Li Chu ◽

Chih-Chieh Hsiao

Keyword(s):

General Purpose ◽

Optimization Techniques ◽

Instruction Level Parallelism ◽

Heterogeneous Platforms ◽

Modern Computer ◽

Level Data ◽

Performance Programming ◽

Architectural Characteristics ◽

Level Parallelism

Heterogeneous platforms that are consisted of CPU and add-on streaming processors are widely used in modern computer systems. These add-on processors provide substantially more computation capability and memory bandwidth than conventional multi-cores platforms. General-purpose computations can also be leveraged onto these add-on processors. In order to utilize their potential performance, programming these streaming processors is challenging because of their diverse underlying architectural characteristics. Several optimization techniques are applied on OpenCL-compatible heterogeneous platforms to achieve thread-level, data-level, and instruction-level parallelism. The architectural implications of these techniques and optimization principles are discussed. Finally, a case study of MRI-Q benchmark will be addressed to illustrate to capabilities of these optimization techniques. The experimental results reveal the speedup from non-optimized to optimized kernel can vary from 8 to 63 on different target platforms.

Download Full-text