scholarly journals A Compiler-assisted locality aware CTA Mapping Scheme

10.29007/55pq ◽  
2019 ◽  
Author(s):  
Lifeng Liu ◽  
Meilin Liu ◽  
Chongjun Wang

General purpose GPU (GPGPU) is an effective many-core architecture that can yield high throughput for many scientific applications with thread-level parallelism. However, several challenges still limit further performance improvements and make GPU program- ming challenging for programmers who lack the knowledge of GPU hardware architecture. In this paper, we design a compiler-assisted locality aware CTA (cooperative thread array) mapping scheme for GPUs to take advantage of the inter CTA data reuses in the GPU kernels. Using the data reuse analysis based on the polyhedron model, we can detect inter CTA data reuse patterns in the GPU kernels and control the CTA mapping pattern to improve the data locality on each SM. The compiler-assisted locality aware CTA mapping scheme can also be combined with the programmable warp scheduler to further improve the performance. The experimental results show that our CTA mapping algorithm can improve the overall performance of the input GPU programs by 23.3% on average and by 56.7% when combined with the programmable warp scheduler.

2018 ◽  
Vol 28 (02) ◽  
pp. 1950020 ◽  
Author(s):  
Yumin Hou ◽  
Xu Wang ◽  
Jiawei Fu ◽  
Junping Ma ◽  
Hu He ◽  
...  

In order to expand the computation capability of digital signal processing on a General Purpose Processor (GPP), we propose a fused microarchitecture that improves Instruction Level Parallelism (ILP) by supporting both in-order superscalar and very long instruction word (VLIW) dispatch methods in a single pipeline. This design is based on ARMv7-A&R Instruction Set Architecture (ISA). To provide a performance comparison, we first design an in-order superscalar processor, considering that ARM GPPs always adopt superscalar approaches. And then we expand VLIW dispatch method based on this processor, to realize the fused microarchitecture. The two designs are both evaluated on the Xilinx 7-series FPGA (XC7K325T-2FFG900C), using Xilinx Vivado design suite. The results show that, compared with the superscalar processor, the processor working under VLIW mode can improve the performance by 15% and 8%, respectively, when running EEMBC and DSPstone benchmarks. We also run the two benchmarks on ARM Cortex-A9 processor, which is integrated in the Zynq-7000 AP SoC device on Xilinx ZC706 evaluation board. The processor in VLIW mode shows 44% and 30% performance improvements than ARM Cortex-A9. The fused microarchitecture adopts a combined bimodal and PAp branch prediction method. This method achieves 93.7% prediction accuracy with limited hardware overhead.


Impact ◽  
2019 ◽  
Vol 2019 (10) ◽  
pp. 44-46
Author(s):  
Masato Edahiro ◽  
Masaki Gondo

The pace of technology's advancements is ever-increasing and intelligent systems, such as those found in robots and vehicles, have become larger and more complex. These intelligent systems have a heterogeneous structure, comprising a mixture of modules such as artificial intelligence (AI) and powertrain control modules that facilitate large-scale numerical calculation and real-time periodic processing functions. Information technology expert Professor Masato Edahiro, from the Graduate School of Informatics at the Nagoya University in Japan, explains that concurrent advances in semiconductor research have led to the miniaturisation of semiconductors, allowing a greater number of processors to be mounted on a single chip, increasing potential processing power. 'In addition to general-purpose processors such as CPUs, a mixture of multiple types of accelerators such as GPGPU and FPGA has evolved, producing a more complex and heterogeneous computer architecture,' he says. Edahiro and his partners have been working on the eMBP, a model-based parallelizer (MBP) that offers a mapping system as an efficient way of automatically generating parallel code for multi- and many-core systems. This ensures that once the hardware description is written, eMBP can bridge the gap between software and hardware to ensure that not only is an efficient ecosystem achieved for hardware vendors, but the need for different software vendors to adapt code for their particular platforms is also eliminated.


2017 ◽  
Vol 29 (70) ◽  
Author(s):  
María Del Carmen Zetina Rodríguez ◽  
Rutilio García Pereyra ◽  
Efraín Rangel Guzmán

El gobierno constituyó la Junta Federal de Mejoras Materiales para administrar y controlar los recursos económicos y la construcción de obras públicas en las fronteras y los puertos de México. El objetivo general de esta investigación fue analizar cómo se instauró y funcionó dicho organismo en Ciudad Juárez, en el contexto de la centralización/federalización de los recursos hídricos del país, de 1931 a 1936; para ello se revisaron los archivos históricos. Una de las limitaciones del estudio fue el desconocimiento de los antecedentes de la administración de los recursos hídricos en este poblado. Por lo que su aportación amplía el conocimiento escaso que había sobre el funcionamiento de las juntas en las fronteras. Entre los descubrimientos se puede citar que en el Ayuntamiento de Juárez, la pérdida de autonomía en la administración de las aguas se sumó a un despojo material y económico, en el que intervinieron varias instituciones y dependencias de gobierno. Water management and the nation’s resources: the Federal Board of Material Improvements, Ciudad Juarez, Chihuahua, 1931-1936The government constituted the Federal Board of Material Improvements in order to manage and control the economic resources and the construction of public works at México’s borders and ports. The general purpose of this research was to analyze how this agency was established and operated in Ciudad Juarez, in the context of the centralization/federalization of the country’s water resources, from 1931 to 1936, and, to this end, the historical archives were reviewed. One of the study’s limitations was the lack of background information about the management of the water resources in this town. Its contribution broadens the scarce existing knowledge about the boards’ functioning at the borders. Among the findings made, it can be mentioned that in the municipality of Juarez the loss of autonomy concerning water management was accompanied by a material and economic dispossession, in which several government institutions and agencies participated.


Author(s):  
Guo Q. Huang ◽  
John A. Brandon

A main theme of concurrent engineering is the effective communication between relevant disciplines. Any computer tools for concurrent engineering must provide sufficient constructs and strategies for this purpose. This paper describes the AGENTS system, a domain-independent general-purpose Object-Oriented Prolog language for cooperating expert systems in concurrent engineering design. Emphasis is placed on demonstrating the use of the AGENTS constructs for distributed knowledge representation and the cooperation strategies for communication, collaboration, conflict resolution, and control. A simple case study is presented to illustrate the balance between simplicity and flexibility.


1958 ◽  
Vol 77 (6) ◽  
pp. 486-491
Author(s):  
W. V. K. Large ◽  
H. J. Michael

Electronics ◽  
2019 ◽  
Vol 8 (11) ◽  
pp. 1342
Author(s):  
Gianvito Urgese ◽  
Francesco Barchi ◽  
Emanuele Parisi ◽  
Evelina Forno ◽  
Andrea Acquaviva ◽  
...  

SpiNNaker is a neuromorphic globally asynchronous locally synchronous (GALS) multi-core architecture designed for simulating a spiking neural network (SNN) in real-time. Several studies have shown that neuromorphic platforms allow flexible and efficient simulations of SNN by exploiting the efficient communication infrastructure optimised for transmitting small packets across the many cores of the platform. However, the effectiveness of neuromorphic platforms in executing massively parallel general-purpose algorithms, while promising, is still to be explored. In this paper, we present an implementation of a parallel DNA sequence matching algorithm implemented by using the MPI programming paradigm ported to the SpiNNaker platform. In our implementation, all cores available in the board are configured for executing in parallel an optimised version of the Boyer-Moore (BM) algorithm. Exploiting this application, we benchmarked the SpiNNaker platform in terms of scalability and synchronisation latency. Experimental results indicate that the SpiNNaker parallel architecture allows a linear performance increase with the number of used cores and shows better scalability compared to a general-purpose multi-core computing platform.


Sign in / Sign up

Export Citation Format

Share Document