scholarly journals Performance analysis of general purpose and digital signal processor kernels for heterogeneous systems-on-chip

2003 ◽  
Vol 1 ◽  
pp. 171-175
Author(s):  
T. von Sydow ◽  
H. Blume ◽  
T. G. Noll

Abstract. Various reasons like technology progress, flexibility demands, shortened product cycle time and shortened time to market have brought up the possibility and necessity to integrate different architecture blocks on one heterogeneous System-on-Chip (SoC). Architecture blocks like programmable processor cores (DSP- and GPP-kernels), embedded FPGAs as well as dedicated macros will be integral parts of such a SoC. Especially programmable architecture blocks and associated optimization techniques are discussed in this contribution. Design space exploration and thus the choice which architecture blocks should be integrated in a SoC is a challenging task. Crucial to this exploration is the evaluation of the application domain characteristics and the costs caused by individual architecture blocks integrated on a SoC. An ATE-cost function has been applied to examine the performance of the aforementioned programmable architecture blocks. Therefore, representative discrete devices have been analyzed. Furthermore, several architecture dependent optimization steps and their effects on the cost ratios are presented.

2003 ◽  
Vol 1 ◽  
pp. 165-169 ◽  
Author(s):  
H. T. Feldkaemper ◽  
H. Blume ◽  
T. G. Noll

Abstract. One of the most challenging design issues for next generations of (mobile) communication systems is fulfilling the computational demands while finding an appropriate trade-off between flexibility and implementation aspects, especially power consumption. Flexibility of modern architectures is desirable, e.g. concerning adaptation to new standards and reduction of time-to-market of a new product. Typical target architectures for future communication systems include embedded FPGAs, dedicated macros as well as programmable digital signal and control oriented processor cores as each of these has its specific advantages. These will be integrated as a System-on-Chip (SoC). For such a heterogeneous architecture a design space exploration and an appropriate partitioning plays a crucial role. On the exemplary vehicle of a Viterbi decoder as frequently used in communication systems we show which costs in terms of ATE complexity arise implementing typical components on different types of architecture blocks. A factor of about seven orders of magnitude spans between a physically optimised implementation and an implementation on a programmable DSP kernel. An implementation on an embedded FPGA kernel is in between these two representing an attractive compromise with high flexibility and low power consumption. Extending this comparison to further components, it is shown quantitatively that the cost ratio between different implementation alternatives is closely related to the operation to be performed. This information is essential for the appropriate partitioning of heterogeneous systems.


Electronics ◽  
2018 ◽  
Vol 7 (10) ◽  
pp. 243 ◽  
Author(s):  
Padmanabhan Balasubramanian ◽  
Douglas Maskell ◽  
Nikos Mastorakis

Adder is an important datapath unit of a general-purpose microprocessor or a digital signal processor. In the nanoelectronics era, the design of an adder that is modular and which can withstand variations in process, voltage and temperature are of interest. In this context, this article presents a new robust early output asynchronous block carry lookahead adder (BCLA) with redundant carry logic (BCLARC) that has a reduced power-cycle time product (PCTP) and is a low power design. The proposed asynchronous BCLARC is implemented using the delay-insensitive dual-rail code and adheres to the 4-phase return-to-zero (RTZ) and the 4-phase return-to-one (RTO) handshaking. Many existing asynchronous ripple-carry adders (RCAs), carry lookahead adders (CLAs) and carry select adders (CSLAs) were implemented alongside to perform a comparison based on a 32/28 nm complementary metal-oxide-semiconductor (CMOS) technology. The 32-bit addition was considered for an example. For implementation using the delay-insensitive dual-rail code and subject to the 4-phase RTZ handshaking (4-phase RTO handshaking), the proposed BCLARC which is robust and of early output type achieves: (i) 8% (5.7%) reduction in PCTP compared to the optimum RCA, (ii) 14.9% (15.5%) reduction in PCTP compared to the optimum BCLARC, and (iii) 26% (25.5%) reduction in PCTP compared to the optimum CSLA.


2021 ◽  
Vol 27 (3) ◽  
pp. 57-70
Author(s):  
Damjan M. Rakanovic ◽  
Vuk Vranjkovic ◽  
Rastislav J. R. Struharik

Paper proposes a two-step Convolutional Neural Network (CNN) pruning algorithm and resource-efficient Field-programmable gate array (FPGA) CNN accelerator named “Argus”. The proposed CNN pruning algorithm first combines similar kernels into clusters, which are then pruned using the same regular pruning pattern. The pruning algorithm is carefully tailored for FPGAs, considering their resource characteristics. Regular sparsity results in high Multiply-accumulate (MAC) efficiency, reducing the amount of logic required to balance workloads among different MAC units. As a result, the Argus accelerator requires about 170 Look-up tables (LUTs) per Digital Signal Processor (DSP) block. This number is close to the average LUT/DPS ratio for various FPGA families, enabling balanced resource utilization when implementing Argus. Benchmarks conducted using Xilinx Zynq Ultrascale + Multi-Processor System-on-Chip (MPSoC) indicate that Argus is achieving up to 25 times higher frames per second than NullHop, 2 and 2.5 times higher than NEURAghe and Snowflake, respectively, and 2 times higher than NVDLA. Argus shows comparable performance to MIT’s Eyeriss v2 and Caffeine, requiring up to 3 times less memory bandwidth and utilizing 4 times fewer DSP blocks, respectively. Besides the absolute performance, Argus has at least 1.3 and 2 times better GOP/s/DSP and GOP/s/Block-RAM (BRAM) ratios, while being competitive in terms of GOP/s/LUT, compared to some of the state-of-the-art solutions.


2018 ◽  
Vol 17 (2) ◽  
pp. 393-415 ◽  
Author(s):  
Stefania Perri ◽  
Fabio Frustaci ◽  
Fanny Spagnolo ◽  
Pasquale Corsonello

2021 ◽  
Vol 26 (2) ◽  
pp. 172-183
Author(s):  
E.S. Yanakova ◽  
◽  
G.T. Macharadze ◽  
L.G. Gagarina ◽  
A.A. Shvachko ◽  
...  

A turn from homogeneous to heterogeneous architectures permits to achieve the advantages of the efficiency, size, weight and power consumption, which is especially important for the built-in solutions. However, the development of the parallel software for heterogeneous computer systems is rather complex task due to the requirements of high efficiency, easy programming and the process of scaling. In the paper the efficiency of parallel-pipelined processing of video information in multiprocessor heterogeneous systems on a chip (SoC) such as DSP, GPU, ISP, VDP, VPU and others, has been investigated. A typical scheme of parallel-pipelined processing of video data using various accelerators has been presented. The scheme of the parallel-pipelined video data on heterogeneous SoC 1892VM248 has been developed. The methods of efficient parallel-pipelined processing of video data in heterogeneous computers (SoC), consisting of the operating system level, programming technologies level and the application level, have been proposed. A comparative analysis of the most common programming technologies, such as OpenCL, OpenMP, MPI, OpenAMP, has been performed. The analysis has shown that depend-ing on the device finite purpose two programming paradigms should be applied: based on OpenCL technology (for built-in system) and MPI technology (for inter-cell and inter processor interaction). The results obtained of the parallel-pipelined processing within the framework of the face recognition have confirmed the effectiveness of the chosen solutions.


Author(s):  
Markeljan Fishta ◽  
Franco Fiori

Abstract$$\varDelta \varSigma $$ Δ Σ analog-to-digital converters (ADCs) are largely used in sensor acquisition applications. In the last few years, standalone $$\varDelta \varSigma $$ Δ Σ modulators have become increasingly available as off-the-shelf parts. To build a complete ADC, a standalone modulator has to be paired with some advanced elaboration unit, such as a field programmable gate array (FPGA) or a digital signal processor (DSP), which is needed for the implementation of the decimation filter. This work investigates the use of low-cost, general-purpose microcontrollers for the decimation of $$\varDelta \varSigma $$ Δ Σ -modulated signals. The main challenge is given by the clock frequency of the modulator, which can be in the range of a few $$\hbox {MHz}$$ MHz . The proposed technique deals with this limitation by employing two serial peripheral interface (SPI) modules in a time-interleaved configuration. This approach allows for continuous acquisition and elaboration of relatively high-speed, digital signals. The technique has been applied to a case study, and a data conversion system has been practically realized. The performance of the proposed filter is compared to that of a digital filter, present on board a commercial microcontroller, and the results of experimental tests are provided.


2016 ◽  
Vol 25 (04) ◽  
pp. 1650027 ◽  
Author(s):  
Kore Sagar Dattatraya ◽  
Belgudri Ritesh Appasaheb ◽  
Ramdas Bhanudas Khaladkar ◽  
V. S. Kanchana Bhaaskaran

Multiplier forms the core building block of any processor, such as the digital signal processor (DSP) and a general purpose microprocessor. As the word length increases, the number of adders or compressors required for the partial product addition also increases. The addition operation of the derived partial products determines the circuit latency, area and speed performance of wider word-length multipliers. Binary count multiplier (BCM) aims to reduce the number of adders and compressors through the use of a uniquely structured binary counter and by suitably altering the logical flow of partial product addition by using binary adders is proposed in this paper. The binary counters for varying bit count values are derived by modifying the basic 4:2 compressor circuit. A [Formula: see text] bit multiplier has been developed to validate the proposed computation method. This logic structure demonstrates lower power operation, reduced device count and lesser delay in comparison against the conventional Wallace tree multiplier structure found in the literature. The BCM implementation realizes 29.17% reduction in the device count, 66% reduction in the delay and 70% reduction in the power dissipation. Furthermore, it realizes 90% reduction in the power delay product (PDP) in comparison against the Wallace tree structure. The multiplier circuits have been implemented and the validation of results has been carried out using Cadence[Formula: see text] EDA tool. Forty five nanometer technology files have been employed for the designs and exhaustive SPICE simulations.


Sign in / Sign up

Export Citation Format

Share Document