A Timing-Driven Partitioning System for Multiple FPGAs

Kalapi Roy; Carl Sechen

doi:10.1155/1996/49565

A Timing-Driven Partitioning System for Multiple FPGAs

VLSI Design ◽

10.1155/1996/49565 ◽

1996 ◽

Vol 4 (4) ◽

pp. 309-328

Author(s):

Kalapi Roy ◽

Carl Sechen

Keyword(s):

Simulated Annealing ◽

Performance Optimization ◽

Clustering Algorithm ◽

Linear Time ◽

Critical Path ◽

Path Delay ◽

Timing Model ◽

Longest Path ◽

Partitioning Algorithm ◽

Circuit Components

Field-programmable systems with multiple FPGAs on a PCB or an MCM are being used by system designers when a single FPGA is not sufficient. We address the problem of partitioning a large technology mapped FPGA circuit onto multiple FPGA devices of a specific target technology. The physical characteristics of the multiple FPGA system (MFS) pose additional constraints to the circuit partitioning algorithms: the capacity of each FPGA, the timing constraints, the number of I/Os per FPGA, and the pre-designed interconnection patterns of each FPGA and the package. Existing partitioning techniques which minimize just the cut sizes of partitions fail to satisfy the above challenges. We therefore present a timing driven N-way partitioning algorithm based on simulated annealing for technology-mapped FPGA circuits. The signal path delays are estimated during partitioning using a timing model specific to a multiple FPGA architecture. The model combines all possible delay factors in a system with multiple FPGA chips of a target technology. Furthermore, we have incorporated a new dynamic net-weighting scheme to minimize the number of pin-outs for each chip. Finally, we have developed a graph-based global router for pin assignment which can handle the pre-routed connections of our MFS structure. In order to reduce the time spent in the simulated annealing phase of the partitioner, clusters of circuit components are identified by a new linear-time bottom-up clustering algorithm. The annealing-based N-way partitioner executes four times faster using the clusters as opposed to a flat netlist with improved partitioning results. For several industrial circuits, our approach outperforms the recursive min-cut bi-partitioning algorithm by 35% in terms of nets cut. Our approach also outperforms an industrial FPGA partitioner by 73% on average in terms of unroutable nets. Using the performance optimization capabilities in our approach we have successfully partitioned the MCNC benchmarks satisfying the critical path constraints and achieving a significant reduction in the longest path delay. An average reduction of 17% in the longest path delay was achieved at the cost of 5% in total wire length.

Download Full-text

A Performance-Oriented Circuit Partitioning Algorithm with Logic-Block Replication for Multi-FPGA Systems

Journal of Circuits System and Computers ◽

10.1142/s0218126697000280 ◽

1997 ◽

Vol 07 (05) ◽

pp. 373-393

Author(s):

Nozomu Togawa ◽

Masao Sato ◽

Tatsuo Ohtsuki

Keyword(s):

Network Flow ◽

Critical Path ◽

Path Delay ◽

Circuit Partitioning ◽

Critical Signal ◽

Delay Constraints ◽

Partitioning Algorithm ◽

Critical Paths ◽

Inputs And Outputs ◽

Logic Blocks

In this paper, we extend the circuit partitioning algorithm which we had proposed for multi-EPGA systems and present a new algorithm in which the delay of each critical signal path is within a specified upper bound imposed on it. The core of the presented algorithm is recursive bipartitioning of a circuit. The bipartitioning procedure consists of three stages: (0) detection of critical paths; (1) bipartitioning of a set of primary inputs and outputs; and (2) bipartitioning of a set of logic-blocks. In (0), the algorithm computes the lower bounds of delays for paths with path delay constraints and detects the critical paths based on the difference between the lower and upper bounds dynamically in every bipartitioning procedure. The delays of the critical paths are reduced with higher priority. In (1), the algorithm attempts to assign the primary inputs and outputs on each critical path to one chip so that the critical path does not cross between chips. Finally in (2), the algorithm not only decreases the number of crossings between chips but also assigns the logic-blocks on each critical path to one chip by exploiting a network flow technique. The algorithm has been implemented and applied to MCNC PARTITIONING 93 benchmark circuits. The experimental results demonstrate that it resolves almost all path delay constraints while maintaining the maximum number of required I/O blocks per chip small compared with conventional algorithms.

Download Full-text

Design and Implementation of a Farrow-Interpolator-Based Digital Front-End in LTE Receivers for Carrier Aggregation

Electronics ◽

10.3390/electronics10030231 ◽

2021 ◽

Vol 10 (3) ◽

pp. 231

Author(s):

Chester Sungchung Park ◽

Sunwoo Kim ◽

Jooho Wang ◽

Sungkyung Park

Keyword(s):

Integrated Circuit ◽

Building Block ◽

Orthogonal Frequency Division Multiplexing ◽

Critical Path ◽

Phase Error ◽

System Level ◽

Comb Filter ◽

Carrier Aggregation ◽

Path Delay ◽

Front End

A digital front-end decimation chain based on both Farrow interpolator for fractional sample-rate conversion and a digital mixer is proposed in order to comply with the long-term evolution standards in radio receivers with ten frequency modes. Design requirement specifications with adjacent channel selectivity, inband blockers, and narrowband blockers are all satisfied so that the proposed digital front-end is 3GPP-compliant. Furthermore, the proposed digital front-end addresses carrier aggregation in the standards via appropriate frequency translations. The digital front-end has a cascaded integrator comb filter prior to Farrow interpolator and also has a per-carrier carrier aggregation filter and channel selection filter following the digital mixer. A Farrow interpolator with an integrate-and-dump circuitry controlled by a condition signal is proposed and also a digital mixer with periodic reset to prevent phase error accumulation is proposed. From the standpoint of design methodology, three models are all developed for the overall digital front-end, namely, functional models, cycle-accurate models, and bit-accurate models. Performance is verified by means of the cycle-accurate model and subsequently, by means of a special C++ class, the bitwidths are minimized in a methodic manner for area minimization. For system-level performance verification, the orthogonal frequency division multiplexing receiver is also modeled. The critical path delay of each building block is analyzed and the spectral-domain view is obtained for each building block of the digital front-end circuitry. The proposed digital front-end circuitry is simulated, designed, and both synthesized in a 180 nm CMOS application-specific integrated circuit technology and implemented in the Xilinx XC6VLX550T field-programmable gate array (Xilinx, San Jose, CA, USA).

Download Full-text

A linear-time algorithm for the longest path problem in rectangular grid graphs

Discrete Applied Mathematics ◽

10.1016/j.dam.2011.08.010 ◽

2012 ◽

Vol 160 (3) ◽

pp. 210-217 ◽

Cited By ~ 18

Author(s):

Fatemeh Keshavarz-Kohjerdi ◽

Alireza Bagheri ◽

Asghar Asgharian-Sardroud

Keyword(s):

Linear Time ◽

Time Algorithm ◽

Linear Time Algorithm ◽

Rectangular Grid ◽

Grid Graphs ◽

Longest Path ◽

Longest Path Problem

Download Full-text

High Efficiency Generalized Parallel Counters for Look-Up Table Based FPGAs

International Journal of Reconfigurable Computing ◽

10.1155/2015/518272 ◽

2015 ◽

Vol 2015 ◽

pp. 1-16 ◽

Cited By ~ 4

Author(s):

Burhan Khurshid ◽

Roohie Naaz Mir

Keyword(s):

Power Dissipation ◽

High Speed ◽

High Efficiency ◽

Critical Path ◽

Fir Filters ◽

Path Delay ◽

Look Up Table ◽

Improved Performance ◽

Ip Cores ◽

Low Efficiency

Generalized parallel counters (GPCs) are used in constructing high speed compressor trees. Prior work has focused on utilizing the fast carry chain and mapping the logic onto Look-Up Tables (LUTs). This mapping is not optimal in the sense that the LUT fabric is not fully utilized. This results in low efficiency GPCs. In this work, we present a heuristic that efficiently maps the GPC logic onto the LUT fabric. We have used our heuristic on various GPCs and have achieved an improvement in efficiency ranging from 33% to 100% in most of the cases. Experimental results using Xilinx 5th-, 6th-, and 7th-generation FPGAs and Stratix IV and V devices from Altera show a considerable reduction in resources utilization and dynamic power dissipation, for almost the same critical path delay. We have also implemented GPC-based FIR filters on 7th-generation Xilinx FPGAs using our proposed heuristic and compared their performance against conventional implementations. Implementations based on our heuristic show improved performance. Comparisons are also made against filters based on integrated DSP blocks and inherent IP cores from Xilinx. The results show that the proposed heuristic provides performance that is comparable to the structures based on these specialized resources.

Download Full-text

Layout-Aware Critical Path Delay Test Under Maximum Power Supply Noise Effects

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ◽

10.1109/tcad.2011.2163159 ◽

2011 ◽

Vol 30 (12) ◽

pp. 1923-1934 ◽

Cited By ~ 18

Author(s):

Junxia Ma ◽

Mohammad Tehranipoor

Keyword(s):

Power Supply ◽

Critical Path ◽

Maximum Power ◽

Path Delay ◽

Power Supply Noise ◽

Delay Test ◽

Noise Effects ◽

Critical Path Delay ◽

Supply Noise ◽

Path Delay Test

Download Full-text

Exploring Linear Structures of Critical Path Delay Faults to Reduce Test Efforts

2006 IEEE/ACM International Conference on Computer Aided Design ◽

10.1109/iccad.2006.320072 ◽

2006 ◽

Author(s):

Shun-yen Lu ◽

Pei-ying Hsieh ◽

Jing-jia Liou

Keyword(s):

Critical Path ◽

Delay Faults ◽

Path Delay ◽

Path Delay Faults ◽

Linear Structures ◽

Critical Path Delay

Download Full-text

A design methodology for approximate multipliers in convolutional neural networks: A case of MNIST

International Journal of Reconfigurable and Embedded Systems (IJRES) ◽

10.11591/ijres.v10.i1.pp1-10 ◽

2021 ◽

Vol 10 (1) ◽

pp. 1

Author(s):

Kenta Shirane ◽

Takahiro Yamamoto ◽

Hiroyuki Tomiyama

Keyword(s):

Neural Network ◽

Neural Networks ◽

Convolutional Neural Network ◽

Design Methodology ◽

Critical Path ◽

High Accuracy ◽

Path Delay ◽

Trade Off ◽

Critical Path Delay

In this paper, we present a case study on approximate multipliers for MNIST Convolutional Neural Network (CNN). We apply approximate multipliers with different bit-width to the convolution layer in MNIST CNN, evaluate the accuracy of MNIST classification, and analyze the trade-off between approximate multiplier’s area, critical path delay and the accuracy. Based on the results of the evaluation and analysis, we propose a design methodology for approximate multipliers. The approximate multipliers consist of some partial products, which are carefully selected according to the CNN input. With this methodology, we further reduce the area and the delay of the multipliers with keeping high accuracy of the MNIST classification.

Download Full-text

A Contraction-based Ratio-cut Partitioning Algorithm

VLSI Design ◽

10.1080/1065514021000012093 ◽

2002 ◽

Vol 15 (2) ◽

pp. 485-489

Author(s):

Youssef Saab

Keyword(s):

Linear Time ◽

Fundamental Problem ◽

Cluster Formation ◽

Vlsi Circuits ◽

Iterative Improvement ◽

Partitioning Algorithm ◽

Partitioning Algorithms ◽

Simple Ratio ◽

Iterative Partitioning

Partitioning is a fundamental problem in the design of VLSI circuits. In recent years, ratio-cut partitioning has received attention due to its tendency to partition circuits into their natural clusters. Node contraction has also been shown to enhance the performance of iterative partitioning algorithms. This paper describes a new simple ratio-cut partitioning algorithm using node contraction. This new algorithm combines iterative improvement with progressive cluster formation. Under suitably mild assumptions, the new algorithm runs in linear time. It is also shown that the new algorithm compares favorably with previous approaches.

Download Full-text

Novel Design of Low-Power High-Speed Hybrid Full Adder Design using Gate Diffusion Input (GDI) Technique

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l7992.1091220 ◽

2020 ◽

Vol 9 (12) ◽

pp. 323-328

Keyword(s):

Power Consumption ◽

Low Power ◽

High Speed ◽

Critical Path ◽

Circuit Simulation ◽

Full Adder ◽

Cmos Process ◽

Path Delay ◽

Process Technology ◽

Xnor Gate

VLSI technology become one of the most significant and demandable because of the characteristics like device portability, device size, large amount of features, expenditure, consistency, rapidity and many others. Multipliers and Adders place an important role in various digital systems such as computers, process controllers and signal processors in order to achieve high speed and low power. Two input XOR/XNOR gate and 2:1 multiplexer modules are used to design the Hybrid Full adders. The XOR/XNOR gate is the key punter of power included in the Full adder cell. However this circuit increases the delay, area and critical path delay. Hence, the optimum design of the XOR/XNOR is required to reduce the power consumption of the Full adder Cell. So a 6 New Hybrid Full adder circuits are proposed based on the Novel Full-Swing XOR/XNOR gates and a New Gate Diffusion Input (GDI) design of Full adder with high-swing outputs. The speed, power consumption, power delay product and driving capability are the merits of the each proposed circuits. This circuit simulation was carried used cadence virtuoso EDA tool. The simulation results based on the 90nm CMOS process technology model.

Download Full-text

Design of delay efficient Booth multiplier using pipelining

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.16.11423 ◽

2018 ◽

Vol 7 (2.16) ◽

pp. 94

Author(s):

Abhishek Choubey ◽

SPV Subbarao ◽

Shruti B. Choubey

Keyword(s):

Critical Path ◽

Arithmetic Operation ◽

Vlsi Design ◽

Digital Signal ◽

Path Delay ◽

Large Area ◽

Booth Multiplier ◽

Critical Path Delay ◽

Long Latency ◽

Comparison Results

Multiplication is one of the most an essential arithmetic operation used in numerous applications in digital signal processing and communications. These applications need transformations, convolutions and dot products that involve an enormous amount of multiplications of an operand with a constant. Typical examples include wavelet, digital filters, such as FIR or IIR. However, multiplier structures have relatively large area-delay product, long latency and significantly high power consumption compared to other the arithmetic structure. Therefore, low power multiplier design has been always a significant part of DSP structure for VLSI design. The Booth multiplier is promising as the most efficient amongst the others multiplier as it reduces the complexity of considerably than others. In this paper, we have proposed Booth-multiplier using seamless pipelining. Theoretical comparison results show that the proposed Booth multiplier requires less critical path delay compared to traditional Booth multiplier. ASIC simulation results show proposed radix-16 Booth multiplier 13% less critical path delay for word width n=16 and 17% less critical path delay compared for bit width n=32 to best existing radix-16 Booth multiplier.

Download Full-text