Performance analysis of general purpose and digital signal processor kernels for heterogeneous systems-on-chip

Advances in Radio Science ◽

10.5194/ars-1-171-2003 ◽

2003 ◽

Vol 1 ◽

pp. 171-175

Author(s):

T. von Sydow ◽

H. Blume ◽

T. G. Noll

Keyword(s):

Digital Signal Processor ◽

Design Space Exploration ◽

Heterogeneous Systems ◽

Digital Signal ◽

General Purpose ◽

Optimization Techniques ◽

Product Cycle ◽

Systems On Chip ◽

Programmable Architecture ◽

On Chip

Abstract. Various reasons like technology progress, flexibility demands, shortened product cycle time and shortened time to market have brought up the possibility and necessity to integrate different architecture blocks on one heterogeneous System-on-Chip (SoC). Architecture blocks like programmable processor cores (DSP- and GPP-kernels), embedded FPGAs as well as dedicated macros will be integral parts of such a SoC. Especially programmable architecture blocks and associated optimization techniques are discussed in this contribution. Design space exploration and thus the choice which architecture blocks should be integrated in a SoC is a challenging task. Crucial to this exploration is the evaluation of the application domain characteristics and the costs caused by individual architecture blocks integrated on a SoC. An ATE-cost function has been applied to examine the performance of the aforementioned programmable architecture blocks. Therefore, representative discrete devices have been analyzed. Furthermore, several architecture dependent optimization steps and their effects on the cost ratios are presented.

Download Full-text

Study of heterogeneous and reconfigurable architectures in the communication domain

Advances in Radio Science ◽

10.5194/ars-1-165-2003 ◽

2003 ◽

Vol 1 ◽

pp. 165-169 ◽

Cited By ~ 3

Author(s):

H. T. Feldkaemper ◽

H. Blume ◽

T. G. Noll

Keyword(s):

Power Consumption ◽

Communication Systems ◽

Design Space Exploration ◽

Heterogeneous Systems ◽

Digital Signal ◽

Cost Ratio ◽

Viterbi Decoder ◽

On Chip ◽

High Flexibility ◽

And Control

Abstract. One of the most challenging design issues for next generations of (mobile) communication systems is fulfilling the computational demands while finding an appropriate trade-off between flexibility and implementation aspects, especially power consumption. Flexibility of modern architectures is desirable, e.g. concerning adaptation to new standards and reduction of time-to-market of a new product. Typical target architectures for future communication systems include embedded FPGAs, dedicated macros as well as programmable digital signal and control oriented processor cores as each of these has its specific advantages. These will be integrated as a System-on-Chip (SoC). For such a heterogeneous architecture a design space exploration and an appropriate partitioning plays a crucial role. On the exemplary vehicle of a Viterbi decoder as frequently used in communication systems we show which costs in terms of ATE complexity arise implementing typical components on different types of architecture blocks. A factor of about seven orders of magnitude spans between a physically optimised implementation and an implementation on a programmable DSP kernel. An implementation on an embedded FPGA kernel is in between these two representing an attractive compromise with high flexibility and low power consumption. Extending this comparison to further components, it is shown quantitatively that the cost ratio between different implementation alternatives is closely related to the operation to be performed. This information is essential for the appropriate partitioning of heterogeneous systems.

Download Full-text

Low Power Robust Early Output Asynchronous Block Carry Lookahead Adder with Redundant Carry Logic

Electronics ◽

10.3390/electronics7100243 ◽

2018 ◽

Vol 7 (10) ◽

pp. 243 ◽

Cited By ~ 8

Author(s):

Padmanabhan Balasubramanian ◽

Douglas Maskell ◽

Nikos Mastorakis

Keyword(s):

Low Power ◽

Digital Signal Processor ◽

Digital Signal ◽

Complementary Metal Oxide Semiconductor ◽

Low Power Design ◽

Cmos Technology ◽

General Purpose ◽

Metal Oxide Semiconductor ◽

Oxide Semiconductor ◽

Power Cycle

Adder is an important datapath unit of a general-purpose microprocessor or a digital signal processor. In the nanoelectronics era, the design of an adder that is modular and which can withstand variations in process, voltage and temperature are of interest. In this context, this article presents a new robust early output asynchronous block carry lookahead adder (BCLA) with redundant carry logic (BCLARC) that has a reduced power-cycle time product (PCTP) and is a low power design. The proposed asynchronous BCLARC is implemented using the delay-insensitive dual-rail code and adheres to the 4-phase return-to-zero (RTZ) and the 4-phase return-to-one (RTO) handshaking. Many existing asynchronous ripple-carry adders (RCAs), carry lookahead adders (CLAs) and carry select adders (CSLAs) were implemented alongside to perform a comparison based on a 32/28 nm complementary metal-oxide-semiconductor (CMOS) technology. The 32-bit addition was considered for an example. For implementation using the delay-insensitive dual-rail code and subject to the 4-phase RTZ handshaking (4-phase RTO handshaking), the proposed BCLARC which is robust and of early output type achieves: (i) 8% (5.7%) reduction in PCTP compared to the optimum RCA, (ii) 14.9% (15.5%) reduction in PCTP compared to the optimum BCLARC, and (iii) 26% (25.5%) reduction in PCTP compared to the optimum CSLA.

Download Full-text

Fast system-level design space exploration for low power configurable multimedia systems-on-chip

15th Annual IEEE International ASIC/SOC Conference ◽

10.1109/asic.2002.1158047 ◽

2003 ◽

Cited By ~ 4

Author(s):

F. Polloni ◽

L. Mazzoni ◽

S. Di Matteo

Keyword(s):

Low Power ◽

Design Space Exploration ◽

Design Space ◽

Space Exploration ◽

Multimedia Systems ◽

System Level ◽

System Level Design ◽

Fast System ◽

Systems On Chip ◽

On Chip

Download Full-text

Argus CNN Accelerator Based on Kernel Clustering and Resource-Aware Pruning

Elektronika ir Elektrotechnika ◽

10.5755/j02.eie.28922 ◽

2021 ◽

Vol 27 (3) ◽

pp. 57-70

Author(s):

Damjan M. Rakanovic ◽

Vuk Vranjkovic ◽

Rastislav J. R. Struharik

Keyword(s):

Digital Signal Processor ◽

State Of The Art ◽

Digital Signal ◽

Pruning Algorithm ◽

Kernel Clustering ◽

Field Programmable ◽

Comparable Performance ◽

On Chip ◽

Resource Characteristics ◽

Resource Aware

Paper proposes a two-step Convolutional Neural Network (CNN) pruning algorithm and resource-efficient Field-programmable gate array (FPGA) CNN accelerator named “Argus”. The proposed CNN pruning algorithm first combines similar kernels into clusters, which are then pruned using the same regular pruning pattern. The pruning algorithm is carefully tailored for FPGAs, considering their resource characteristics. Regular sparsity results in high Multiply-accumulate (MAC) efficiency, reducing the amount of logic required to balance workloads among different MAC units. As a result, the Argus accelerator requires about 170 Look-up tables (LUTs) per Digital Signal Processor (DSP) block. This number is close to the average LUT/DPS ratio for various FPGA families, enabling balanced resource utilization when implementing Argus. Benchmarks conducted using Xilinx Zynq Ultrascale + Multi-Processor System-on-Chip (MPSoC) indicate that Argus is achieving up to 25 times higher frames per second than NullHop, 2 and 2.5 times higher than NEURAghe and Snowflake, respectively, and 2 times higher than NVDLA. Argus shows comparable performance to MIT’s Eyeriss v2 and Caffeine, requiring up to 3 times less memory bandwidth and utilizing 4 times fewer DSP blocks, respectively. Besides the absolute performance, Argus has at least 1.3 and 2 times better GOP/s/DSP and GOP/s/Block-RAM (BRAM) ratios, while being competitive in terms of GOP/s/LUT, compared to some of the state-of-the-art solutions.

Download Full-text

Design space exploration for robust power delivery in TSV based 3-D systems-on-chip

2012 IEEE International SOC Conference ◽

10.1109/socc.2012.6398327 ◽

2012 ◽

Author(s):

Suhas M. Satheesh ◽

Emre Salman

Keyword(s):

Design Space Exploration ◽

Design Space ◽

Space Exploration ◽

Power Delivery ◽

Systems On Chip ◽

On Chip

Download Full-text

Stereo vision architecture for heterogeneous systems-on-chip

Journal of Real-Time Image Processing ◽

10.1007/s11554-018-0782-z ◽

2018 ◽

Vol 17 (2) ◽

pp. 393-415 ◽

Cited By ~ 3

Author(s):

Stefania Perri ◽

Fabio Frustaci ◽

Fanny Spagnolo ◽

Pasquale Corsonello

Keyword(s):

Stereo Vision ◽

Heterogeneous Systems ◽

Systems On Chip ◽

On Chip

Download Full-text

Parallel-Pipelined Video Processing in Multicore Heterogeneous Systems on Chip

Proceedings of Universities ELECTRONICS ◽

10.24151/1561-5405-2021-26-2-172-183 ◽

2021 ◽

Vol 26 (2) ◽

pp. 172-183

Author(s):

E.S. Yanakova ◽

◽

G.T. Macharadze ◽

L.G. Gagarina ◽

A.A. Shvachko ◽

...

Keyword(s):

Video Processing ◽

High Efficiency ◽

Heterogeneous Systems ◽

Video Data ◽

System Level ◽

Video Information ◽

Systems On Chip ◽

The Face ◽

Parallel Pipelined ◽

On Chip

A turn from homogeneous to heterogeneous architectures permits to achieve the advantages of the efficiency, size, weight and power consumption, which is especially important for the built-in solutions. However, the development of the parallel software for heterogeneous computer systems is rather complex task due to the requirements of high efficiency, easy programming and the process of scaling. In the paper the efficiency of parallel-pipelined processing of video information in multiprocessor heterogeneous systems on a chip (SoC) such as DSP, GPU, ISP, VDP, VPU and others, has been investigated. A typical scheme of parallel-pipelined processing of video data using various accelerators has been presented. The scheme of the parallel-pipelined video data on heterogeneous SoC 1892VM248 has been developed. The methods of efficient parallel-pipelined processing of video data in heterogeneous computers (SoC), consisting of the operating system level, programming technologies level and the application level, have been proposed. A comparative analysis of the most common programming technologies, such as OpenCL, OpenMP, MPI, OpenAMP, has been performed. The analysis has shown that depend-ing on the device finite purpose two programming paradigms should be applied: based on OpenCL technology (for built-in system) and MPI technology (for inter-cell and inter processor interaction). The results obtained of the parallel-pipelined processing within the framework of the face recognition have confirmed the effectiveness of the chosen solutions.

Download Full-text

Decimation of Delta-Sigma-Modulated Signals Using a Low-Cost Microcontroller

Circuits Systems and Signal Processing ◽

10.1007/s00034-021-01772-z ◽

2021 ◽

Author(s):

Markeljan Fishta ◽

Franco Fiori

Keyword(s):

Digital Signal Processor ◽

High Speed ◽

Low Cost ◽

Digital Signal ◽

Experimental Tests ◽

General Purpose ◽

Data Conversion ◽

Clock Frequency ◽

Main Challenge ◽

Modulated Signals

Abstract$$\varDelta \varSigma $$ Δ Σ analog-to-digital converters (ADCs) are largely used in sensor acquisition applications. In the last few years, standalone $$\varDelta \varSigma $$ Δ Σ modulators have become increasingly available as off-the-shelf parts. To build a complete ADC, a standalone modulator has to be paired with some advanced elaboration unit, such as a field programmable gate array (FPGA) or a digital signal processor (DSP), which is needed for the implementation of the decimation filter. This work investigates the use of low-cost, general-purpose microcontrollers for the decimation of $$\varDelta \varSigma $$ Δ Σ -modulated signals. The main challenge is given by the clock frequency of the modulator, which can be in the range of a few $$\hbox {MHz}$$ MHz . The proposed technique deals with this limitation by employing two serial peripheral interface (SPI) modules in a time-interleaved configuration. This approach allows for continuous acquisition and elaboration of relatively high-speed, digital signals. The technique has been applied to a case study, and a data conversion system has been practically realized. The performance of the proposed filter is compared to that of a digital filter, present on board a commercial microcontroller, and the results of experimental tests are provided.

Download Full-text

A four-channel digital signal processor in 1.2- mu m CMOS with on-chip D/A and A/D conversion serving four speech channels in a new-generation subscriber line circuit

IEEE Journal of Solid-State Circuits ◽

10.1109/4.92024 ◽

1991 ◽

Vol 26 (7) ◽

pp. 1038-1046 ◽

Cited By ~ 3

Author(s):

D. Haspeslagh ◽

J. Sevenhans ◽

A. Delarbre ◽

L. Kiss ◽

E. Moerman

Keyword(s):

Digital Signal Processor ◽

Digital Signal ◽

Subscriber Line ◽

On Chip ◽

New Generation ◽

Signal Processor

Download Full-text

Low Power, High Speed and Area Efficient Binary Count Multiplier

Journal of Circuits System and Computers ◽

10.1142/s0218126616500274 ◽

2016 ◽

Vol 25 (04) ◽

pp. 1650027 ◽

Cited By ~ 1

Author(s):

Kore Sagar Dattatraya ◽

Belgudri Ritesh Appasaheb ◽

Ramdas Bhanudas Khaladkar ◽

V. S. Kanchana Bhaaskaran

Keyword(s):

Digital Signal Processor ◽

Word Length ◽

High Speed ◽

Digital Signal ◽

General Purpose ◽

Computation Method ◽

Partial Product ◽

Wallace Tree ◽

Power Delay Product ◽

Binary Count

Multiplier forms the core building block of any processor, such as the digital signal processor (DSP) and a general purpose microprocessor. As the word length increases, the number of adders or compressors required for the partial product addition also increases. The addition operation of the derived partial products determines the circuit latency, area and speed performance of wider word-length multipliers. Binary count multiplier (BCM) aims to reduce the number of adders and compressors through the use of a uniquely structured binary counter and by suitably altering the logical flow of partial product addition by using binary adders is proposed in this paper. The binary counters for varying bit count values are derived by modifying the basic 4:2 compressor circuit. A [Formula: see text] bit multiplier has been developed to validate the proposed computation method. This logic structure demonstrates lower power operation, reduced device count and lesser delay in comparison against the conventional Wallace tree multiplier structure found in the literature. The BCM implementation realizes 29.17% reduction in the device count, 66% reduction in the delay and 70% reduction in the power dissipation. Furthermore, it realizes 90% reduction in the power delay product (PDP) in comparison against the Wallace tree structure. The multiplier circuits have been implemented and the validation of results has been carried out using Cadence[Formula: see text] EDA tool. Forty five nanometer technology files have been employed for the designs and exhaustive SPICE simulations.

Download Full-text