processor array
Recently Published Documents


TOTAL DOCUMENTS

326
(FIVE YEARS 3)

H-INDEX

18
(FIVE YEARS 0)

2021 ◽  
Vol 20 (5) ◽  
pp. 1-31
Author(s):  
Michael Witterauf ◽  
Dominik Walter ◽  
Frank Hannig ◽  
Jürgen Teich

Tightly Coupled Processor Arrays (TCPAs), a class of massively parallel loop accelerators, allow applications to offload computationally expensive loops for improved performance and energy efficiency. To achieve these two goals, executing a loop on a TCPA requires an efficient generation of specific programs as well as other configuration data for each distinct combination of loop bounds and number of available processing elements (PEs). Since both these parameters are generally unknown at compile time—the number of available PEs due to dynamic resource management, and the loop bounds, because they depend on the problem size—both the programs and configuration data must be generated at runtime. However, pure just-in-time compilation is impractical, because mapping a loop program onto a TCPA entails solving multiple NP-complete problems. As a solution, this article proposes a unique mixed static/dynamic approach called symbolic loop compilation. It is shown that at compile time, the NP-complete problems (modulo scheduling, register allocation, and routing) can still be solved to optimality in a symbolic way resulting in a so-called symbolic configuration , a space-efficient intermediate representation parameterized in the loop bounds and number of PEs. This phase is called symbolic mapping . At runtime, for each requested accelerated execution of a loop program with given loop bounds and known number of available PEs, a concrete configuration , including PE programs and configuration data for all other components, is generated from the symbolic configuration according to these parameter values. This phase is called instantiation . We describe both phases in detail and show that instantiation runs in polynomial time with its most complex step, program instantiation, not directly depending on the number of PEs and thus scaling to arbitrary sizes of TCPAs. To validate the efficiency of this mixed static/dynamic compilation approach, we apply symbolic loop compilation to a set of real-world loop programs from several domains, measuring both compilation time and space requirements. Our experiments confirm that a symbolic configuration is a space-efficient representation suited for systems with little memory—in many cases, a symbolic configuration is smaller than even a single concrete configuration instantiated from it—and that the times for the runtime phase of program instantiation and configuration loading are negligible and moreover independent of the size of the available processor array. To give an example, instantiating a configuration for a matrix-matrix multiplication benchmark takes equally long for 4× 4 and 32× 32 PEs.


2021 ◽  
Author(s):  
Hector Castillo-Elizalde ◽  
Yanan Liu ◽  
Laurie Bose ◽  
Walterio Mayol-Cuevas
Keyword(s):  

2021 ◽  
Author(s):  
Yanan Liu ◽  
Laurie Bose ◽  
Colin Greatwood ◽  
Jianing Chen ◽  
Rui Fan ◽  
...  

Author(s):  
Stephen J. Carey ◽  
Laurie Bose ◽  
Thomas Richardson ◽  
Walterio Mayol-Cuevas ◽  
Jianing Chen ◽  
...  

Sensor Review ◽  
2020 ◽  
Vol 40 (4) ◽  
pp. 521-528
Author(s):  
Ahmad Reza Danesh ◽  
Mehdi Habibi

Purpose The purpose of this paper is to design a kernel convolution processor. High-speed image processing is a challenging task for real-time applications such as product quality control of manufacturing lines. Smart image sensors use an array of in-pixel processors to facilitate high-speed real-time image processing. These sensors are usually used to perform the initial low-level bulk image filtering and enhancement. Design/methodology/approach In this paper, using pulse-width modulated signals and regular nearest neighbor interconnections, a convolution image processor is presented. The presented processor is not only capable of processing arbitrary size kernels but also the kernel coefficients can be any arbitrary positive or negative floating number. Findings The performance of the proposed architecture is evaluated on a Xilinx Virtex-7 field programmable gate array platform. The peak signal-to-noise ratio metric is used to measure the computation error for different images, filters and illuminations. Finally, the power consumption of the circuit in different operating conditions is presented. Originality/value The presented processor array can be used for high-speed kernel convolution image processing tasks including arbitrary size edge detection and sharpening functions, which require negative and fractional kernel values.


2020 ◽  
Vol 138 ◽  
pp. 32-47
Author(s):  
Aaron Stillmaker ◽  
Brent Bohnenstiehl ◽  
Lucas Stillmaker ◽  
Bevan Baas

2019 ◽  
Vol 28 (07) ◽  
pp. 1950111
Author(s):  
Jigang Wu ◽  
Yalan Wu ◽  
Guiyuan Jiang ◽  
Siew Kei Lam

This paper investigates the techniques to construct high-quality target processor array (fault-free logical subarray) from a physical array with faulty processing elements (PEs), where a fixed number of spare PEs are pre-integrated that can be used to replace the faulty ones when necessary. A reconfiguration algorithm is successfully developed based on our proposed novel shifting operations that can efficiently select proper spare PEs to replace the faulty ones. Then, the initial target array is further refined by a carefully designed tabu search algorithm. We also consider the problem of constructing a fault-free subarray with given size, instead of the original size, which is often required in energy-efficient MPSoC design. We propose two efficient heuristic algorithms to construct target arrays of given sizes leveraging a sliding window on the physical array. Simulation results show that the improvements of the proposed algorithms over the state of the art are [Formula: see text] and [Formula: see text], in terms of congestion factor and distance factor, respectively, for the case that all faulty PEs can be replaced using the spare ones. For the case of finding [Formula: see text] target array on [Formula: see text] host array, the proposed heuristic algorithm saves the running time up to [Formula: see text] while the solution quality keeps nearly unchanged, in comparison with the baseline algorithms.


2019 ◽  
Vol 2019 ◽  
pp. 1-11 ◽  
Author(s):  
Awos Kanan ◽  
Fayez Gebali ◽  
Atef Ibrahim ◽  
Kin Fun Li

Processor array architectures have been employed, as an accelerator, to compute similarity distance found in a variety of data mining algorithms. However, most of the proposed architectures in the existing literature are designed in an ad hoc manner without taking into consideration the size and dimensionality of the datasets. Furthermore, data dependencies have not been analyzed, and often, only one design choice is considered for the scheduling and mapping of computational tasks. In this work, we present a systematic methodology to design scalable and area-efficient linear (1-D) processor arrays for the computation of similarity distance matrices. Six possible design options are obtained and analyzed in terms of area and time complexities. The obtained architectures provide us with the flexibility to choose the one that meets hardware constraints for a specific problem size. Comparisons with the previously reported architectures demonstrate that one of the proposed architectures achieves less area and area-delay product besides its scalability to high-dimensional data.


Sign in / Sign up

Export Citation Format

Share Document