scholarly journals FuMicro: A Fused Microarchitecture Design Integrating In-Order Superscalar and VLIW

VLSI Design ◽  
2016 ◽  
Vol 2016 ◽  
pp. 1-12 ◽  
Author(s):  
Yumin Hou ◽  
Hu He ◽  
Xu Yang ◽  
Deyuan Guo ◽  
Xu Wang ◽  
...  

This paper proposes FuMicro, a fused microarchitecture integrating both in-order superscalar and Very Long Instruction Word (VLIW) in a single core. A processor with FuMicro microarchitecture can work under alternative in-order superscalar and VLIW mode, using the same pipeline and the same Instruction Set Architecture (ISA). Small modification to the compiler is made to expand the register file in VLIW mode. The decision of mode switch is made by software, and this does not need extra hardware. VLIW code can be exploited in the form of library function and the users will be exposed under only superscalar mode; by this means, we can provide the users with a convenient development environment. FuMicro could serve as a universal microarchitecture for it can be applied to different ISAs. In this paper, we focus on the implementation of FuMicro with ARM ISA. This architecture is evaluated on gem5, which is a cycle accurate microarchitecture simulation platform. By adopting FuMicro microarchitecture, the performance can be improved on an average of 10%, with the best performance improvement being 47.3%, compared with that under pure in-order superscalar mode. The result shows that FuMicro microarchitecture can improve Instruction Level Parallelism (ILP) significantly, making it promising to expand digital signal processing capability on a General Purpose Processor.

2018 ◽  
Vol 28 (02) ◽  
pp. 1950020 ◽  
Author(s):  
Yumin Hou ◽  
Xu Wang ◽  
Jiawei Fu ◽  
Junping Ma ◽  
Hu He ◽  
...  

In order to expand the computation capability of digital signal processing on a General Purpose Processor (GPP), we propose a fused microarchitecture that improves Instruction Level Parallelism (ILP) by supporting both in-order superscalar and very long instruction word (VLIW) dispatch methods in a single pipeline. This design is based on ARMv7-A&R Instruction Set Architecture (ISA). To provide a performance comparison, we first design an in-order superscalar processor, considering that ARM GPPs always adopt superscalar approaches. And then we expand VLIW dispatch method based on this processor, to realize the fused microarchitecture. The two designs are both evaluated on the Xilinx 7-series FPGA (XC7K325T-2FFG900C), using Xilinx Vivado design suite. The results show that, compared with the superscalar processor, the processor working under VLIW mode can improve the performance by 15% and 8%, respectively, when running EEMBC and DSPstone benchmarks. We also run the two benchmarks on ARM Cortex-A9 processor, which is integrated in the Zynq-7000 AP SoC device on Xilinx ZC706 evaluation board. The processor in VLIW mode shows 44% and 30% performance improvements than ARM Cortex-A9. The fused microarchitecture adopts a combined bimodal and PAp branch prediction method. This method achieves 93.7% prediction accuracy with limited hardware overhead.


Author(s):  
Sanket Dessai ◽  
Krishna Bhushan Vutukuru

Graphical Processing Units (GPUs) have become an integral part of today’s mainstream computing systems. They are also being used as reprogrammable General Purpose GPUs (GP-GPUs) to perform complex scientific computations. Reconfigurability is an attractive approach to embedded systems allowing hardware level modification. Hence, there is a high demand for GPU designs based on reconfigurable hardware. Stream processor consists of clusters of functional units which provide a bandwidth hierarchy, supporting hundreds of arithmetic units. The arithmetic cluster units are designed to exploit instruction level parallelism and subword parallelism within a cluster and data parallelism across the clusters.For decreasing the area and power, a single controller is used to control data flow between clusters and between host processor and GPU. The designed of stream processor unit has been carried out in Verilog on Altera Quartus II and simulated using ModelSim tools. The functionality of the modelled blocks is verified using test inputs in the simulator.The simulated execution time of 8-bit pipelined multiplier is 60 ps and 100 ns for 8-bit pipelined adder while operating at 90 MHz.


2020 ◽  
Vol 43 (2-3) ◽  
pp. 89-108
Author(s):  
Angelo Fraietta ◽  
Oliver Bown ◽  
Sam Ferguson ◽  
Sam Gillespie ◽  
Liam Bray

This article introduces an open-source Java-based programming environment for creative coding of agglomerative systems using Internet-of-Things (IoT) technologies. Our software originally focused on digital signal processing of audio—including synthesis, sampling, granular sample playback, and a suite of basic effects—but composers now use it to interface with sensors and peripherals through general-purpose input/output and external networked systems. This article examines and addresses the strategies required to integrate novel embedded musical interfaces and creative coding paradigms through an IoT infrastructure. These include: the use of advanced tooling features of a professional integrated development environment as a composition or performance interface rather than just as a compiler; techniques to create media works using features such as autodetection of sensors; seamless and serverless communication among devices on the network; and uploading, updating, and running of new compositions to the device without interruption. Furthermore, we examined the difficulties many novice programmers experience when learning to write code, and we developed strategies to address these difficulties without restricting the potential available in the coding environment. We also examined and developed methods to monitor and debug devices over the network, allowing artists and programmers to set and retrieve current variable values to or from these devices during the performance and composition stages. Finally, we describe three types of art work that demonstrate how the software, called HappyBrackets, is being used in live-coding and dance performances, in interactive sound installations, and as an advanced composition and performance tool for multimedia works.


2013 ◽  
Vol 427-429 ◽  
pp. 2822-2825
Author(s):  
Chang Qin Yan ◽  
Yan Yan Yu ◽  
Qian Huang ◽  
Jun Yang

Superscalar pipelining is to improve instruction-level parallelism and advanced technology, and is widely used in the computer's central processor and graphic accelerator. In this paper, we made use of advantage of CPLD devicesinherent flexibility, usability, predictability and so on, to achieve superscalar pipelining, designed and constructed a processor model superscalar pipeline machine based on RISC instruction set. Using EDA technology with top-down design methods, and gave the processor model hardware verification and performance test results, and had explored the use of EDA technology processor design ideas and methods.


2011 ◽  
Vol 2011 ◽  
pp. 1-13 ◽  
Author(s):  
Mateus B. Rutzig ◽  
Antonio C. S. Beck ◽  
Felipe Madruga ◽  
Marco A. Alves ◽  
Henrique C. Freitas ◽  
...  

Limits of instruction-level parallelism and higher transistor density sustain the increasing need for multiprocessor systems: they are rapidly taking over both general-purpose and embedded processor domains. Current multiprocessing systems are composed either of many homogeneous and simple cores or of complex superscalar, simultaneous multithread processing elements. As parallel applications are becoming increasingly present in embedded and general-purpose domains and multiprocessing systems must handle a wide range of different application classes, there is no consensus over which are the best hardware solutions to better exploit instruction-level parallelism (TLP) and thread-level parallelism (TLP) together. Therefore, in this work, we have expanded the DIM (dynamic instruction merging) technique to be used in a multiprocessing scenario, proving the need for an adaptable ILP exploitation even in TLP architectures. We have successfully coupled a dynamic reconfigurable system to an SPARC-based multiprocessor and obtained performance gains of up to 40%, even for applications that show a great level of parallelism at thread level.


Author(s):  
Maman Abdurohman

Prosesor DLX adalah sebuah prosesor berbasis RISC (Reduced Instruction Set Computer) yang dirancang sebagai prosesor tujuan umum (general purpose processor). Prosesor ini mempunyai arsitektur load-store dengan panjang semua instruksinya 32 bit. Setiap instruksi dieksekusi dalam beberapa siklus waktu (cycletime). Secara umum time cycle yang digunakan sebanyak lima tahap yang terdiri dari tahap-tahap : Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), dan Write Back (WB). Kelima tahap ini dikerjakan secara berurutan [2]. Sebagai prosesor multicycle, DLX mempunyai peluang untuk meningkatkan kinerjanya yang diukur dengan kecepatan proses yang dinyatkan sebagai waktu CPU (CPU time). Peningkatan kinerja prosesor DLX dapat diterapkan dengan menggunakan teknik pipeline. Pada jurnal ini telah dianalisis peningkatan performansi prosesor DLX dengan menggunakan teknik pipeline. Uji coba dilakukan terhadap beberapa program aplikasi yang dieksekusi dengan menggunakan teknik pipeline dan tanpa menggunakan teknik pipeline. Secara umum terjadi peningkatan kecepatan pada setiap kumpulan instruksi yang dianalisis. Proses pengujian dilakukan dengan menggunakan simulator windlx yang merupakan simulator prosesor DLX.Kata kunci : Prosesor DLX, RISC, general purpose processor, CPU, Pipeline, windlx


2019 ◽  
pp. 60-63
Author(s):  
S. A. Koltakov ◽  
A. A. Cherepnev

The  article  describes  the  hardware‑software  complex  (HSC)  based  on  the  debugging  stand,  its  composition,  modules  and  operations. A method for synthesizing the output signal is described, a formula and a table of parameters for its calculation are  given. Signals and spectra at the input and output of the developed HSC are shown. The obtained parameters of the performance  of various agribusiness, based on the signal processor with a General‑purpose processor and two variants with General‑purpose  processors.  The  proposed  version  of  the  HSC2–3  times  wins  in  performance  compared  to  the  HSC  based  on  the  general‑ purpose processor of Intel. This is achieved through the use of modern methods and programming tools, digital signal processing  modules, as well as the optimization of the executable code. Recommendations for possible further improvement of the proposed  complex are given, which is possible due to the use of modern FPGAs and high‑speed interface.


2007 ◽  
Vol 16 (06) ◽  
pp. 911-927
Author(s):  
S. MAJZOUB ◽  
H. DIAB

Reconfigurable Systems represent a middle trade-off between speed and flexibility in the processor design world. It provides performance close to the custom-hardware and yet preserves some of the general-purpose processor flexibility. Recently, the area of reconfigurable computing has received considerable interest in both its forms: the FPGA and coarse-grain hardware. Since the field is still in its developing stage, it is important to perform hardware analysis and evaluation of certain key applications on target reconfigurable architectures to identify potential limitations and improvements. This paper presents the mapping and performance analysis of two encryption algorithms, namely Rijndael and Twofish, on a coarse grain reconfigurable platform, namely MorphoSys. MorphoSys is a reconfigurable architecture targeted for multimedia applications. Since many cryptographic algorithms involve bitwise operations, bitwise instruction set extension was proposed to enhance the performance. We present the details of the mapping of the bitwise operations involved in the algorithms with thorough analysis. The methodology we used can be utilized in other systems.


2012 ◽  
Vol 4 (3) ◽  
pp. 48-62
Author(s):  
Slo-Li Chu ◽  
Chih-Chieh Hsiao

Heterogeneous platforms that are consisted of CPU and add-on streaming processors are widely used in modern computer systems. These add-on processors provide substantially more computation capability and memory bandwidth than conventional multi-cores platforms. General-purpose computations can also be leveraged onto these add-on processors. In order to utilize their potential performance, programming these streaming processors is challenging because of their diverse underlying architectural characteristics. Several optimization techniques are applied on OpenCL-compatible heterogeneous platforms to achieve thread-level, data-level, and instruction-level parallelism. The architectural implications of these techniques and optimization principles are discussed. Finally, a case study of MRI-Q benchmark will be addressed to illustrate to capabilities of these optimization techniques. The experimental results reveal the speedup from non-optimized to optimized kernel can vary from 8 to 63 on different target platforms.


Sign in / Sign up

Export Citation Format

Share Document