scholarly journals A Universal, Dynamically Adaptable and Programmable Network Router for Parallel Computers

VLSI Design ◽  
2001 ◽  
Vol 12 (1) ◽  
pp. 25-52 ◽  
Author(s):  
Taras I. Golota ◽  
Sotirios G. Ziavras

Existing message-passing parallel computers employ routers designed for a specific interconnection network and deal with fixed data channel width. There are disadvantages to this approach, because the system design and development times are significant and these routers do not permit run time network reconfiguration. Changes in the topology of the network may be required for better performance or faulttolerance. In this paper, we introduce a class of high-performance universal (statically and dynamically adaptable) programmable routers (UPRs) for message-passing parallel computers. The universality of these routers is based on their capability to adapt at run and/or static times according to the characteristics of the systems and/or applications. More specifically, the number of bidirectional data channels, the channel size and the I/O port mappings (for the implementation of a particular topology) can change dynamically and statically. Our research focuses on system-level specification issues of the UPRs, their VLSI design and their simulation to estimate their performance. Our simulation of data transfers via UPR routers employs VHDL code in the Mentor Graphics environment. The results show that the performance of the routers depends mostly on their current configuration. Details of the simulation and synthesis are presented.

Author(s):  
E. A. Ashcroft ◽  
A. A. Faustini ◽  
R. Jaggannathan ◽  
W. W. Wadge

In Chapter 1, we saw how Lucid could be used to express solutions to standard problems such as sorting and matrix multiplication. One of the unique characteristics of Lucid is not only that it can be used as a programming language but it can also be used as a “composition” language. That is, instead of using Lucid to specify computations, it can be used to express how computation components (expressed in some other language) can be “glued” together to form a coherent application. By doing so, the resulting application can enjoy some of the practical benefits attributable to Lucid such as high performance through exploitation of implicit parallelism and robustness through software fault tolerance. In this chapter, we discuss one such use of Lucid—as part of a hybrid language to construct parallel applications to be executed on conventional parallel computers. A conventional parallel computer either consists of a number of processors each with local memory interconnected by a network (as in distributed memory architectures) or a number of processors that share memory possibly using an interconnection network (as in shared memory architectures). The past decade has seen the advent of conventional parallel computers starting with the Denelcor HEP evolving to the CM-2 and Intel Hypercube and further evolving to the CM-5, Intel Paragon, Cray T3D, and IBM SP-2. Even networks of workstations (or workstation clusters) are seen as low-cost (“poor man’s”) parallel computers. Programming of conventional parallel computers has proven to be far more challenging than had been expected. Part of the reason is the continued use of low-level, explicitly parallel programming models such as PVM [42], Linda [10]. Two factors have fueled the continuing use of such languages despite their limited success. 1. The need to reuse existing sequential code because the cost of rewriting legacy applications from scratch is considered prohibitive both in economic and technical terms. 2. The need to run on conventional parallel computers that view a “parallel program” at a low level—as consisting of sequential processes that frequently synchronize and communicate with each other using some form of message passing.


2009 ◽  
Vol 2009 ◽  
pp. 1-9
Author(s):  
Manuel Saldaña ◽  
Emanuel Ramalho ◽  
Paul Chow

High-performance reconfigurable computers (HPRCs) provide a mix of standard processors and FPGAs to collectively accelerate applications. This introduces new design challenges, such as the need for portable programming models across HPRCs and system-level verification tools. To address the need for cosimulating a complete heterogeneous application using both software and hardware in an HPRC, we have created a tool called the Message-passing Simulation Framework (MSF). We have used it to simulate and develop an interface enabling an MPI-based approach to exchange data between X86 processors and hardware engines inside FPGAs. The MSF can also be used as an application development tool that enables multiple FPGAs in simulation to exchange messages amongst themselves and with X86 processors. As an example, we simulate a LINPACK benchmark hardware core using an Intel-FSB-Xilinx-FPGA platform to quickly prototype the hardware, to test the communications. and to verify the benchmark results.


Author(s):  
SOTIRIOS G. ZIAVRAS ◽  
MICHALIS A. SIDERAS

The direct binary hypercube interconnection network has been very popular for the design of parallel computers, because it provides a low diameter and can emulate efficiently the majority of the topologies frequently employed in the development of algorithms. The last fifteen years have seen major efforts to develop image analysis algorithms for hypercube-based parallel computers. The results of these efforts have culminated in a large number of publications included in prestigious scholarly journals and conference proceedings. Nevertheless, the aforementioned powerful properties of the hypercube come at the cost of high VLSI complexity due to the increase in the number of communication ports and channels per PE (processing element) with an increase in the total number of PE’s. The high VLSI complexity of hypercube systems is undoubtedly their dominant drawback; it results in the construction of systems that contain either a large number of primitive PE’s or a small number of powerful PE’s. Therefore, low-dimensional k-ary n-cubes with lower VSLI complexity have recently drawn the attention of many designers of parallel computers. Alternative solutions reduce the hypercube’s VLSI complexity without jeopardizing its performance. Such an effort by Ziavras has resulted in the introduction of reduced hypercubes (RH’s). Taking advantage of existing high-performance routing techniques, such as wormhole routing, an RH is obtained by a uniform reduction in the number of edges for each hypercube node. An RH can also be viewed as several connected copies of the well-known cube-connected-cycles network. The objective here is to prove that parallel computers comprising RH interconnection networks are definitely good choices for all levels of image analysis. Since the exact requirements of high-level image analysis are difficult to identify, but it is believed that versatile interconnection networks, such as the hypercube, are suitable for relevant tasks, we investigate the problem of emulating hypercubes on RH’s. The ring (or linear array), the torus (or mesh), and the binary tree are the most frequently used topologies for the development of algorithms in low-level and intermediate-level image analysis. Thus, to prove the viability of the RH for the two lower levels of image analysis, we introduce techniques for embedding the aforementioned three topologies into RH’s. The results prove the suitability of RH’s for all levels of image analysis.


2020 ◽  
Vol 15 ◽  
Author(s):  
Weiwen Zhang ◽  
Long Wang ◽  
Theint Theint Aye ◽  
Juniarto Samsudin ◽  
Yongqing Zhu

Background: Genotype imputation as a service is developed to enable researchers to estimate genotypes on haplotyped data without performing whole genome sequencing. However, genotype imputation is computation intensive and thus it remains a challenge to satisfy the high performance requirement of genome wide association study (GWAS). Objective: In this paper, we propose a high performance computing solution for genotype imputation on supercomputers to enhance its execution performance. Method: We design and implement a multi-level parallelization that includes job level, process level and thread level parallelization, enabled by job scheduling management, message passing interface (MPI) and OpenMP, respectively. It involves job distribution, chunk partition and execution, parallelized iteration for imputation and data concatenation. Due to the design of multi-level parallelization, we can exploit the multi-machine/multi-core architecture to improve the performance of genotype imputation. Results: Experiment results show that our proposed method can outperform the Hadoop-based implementation of genotype imputation. Moreover, we conduct the experiments on supercomputers to evaluate the performance of the proposed method. The evaluation shows that it can significantly shorten the execution time, thus improving the performance for genotype imputation. Conclusion: The proposed multi-level parallelization, when deployed as an imputation as a service, will facilitate bioinformatics researchers in Singapore to conduct genotype imputation and enhance the association study.


Author(s):  
Nhan Phan-Thien ◽  
Sangtae Kim

This monograph describes various methods for solving deformation problems of particulate solids, taking the reader from analytical to computational methods. The book is the first to present the topic of linear elasticity in mathematical terms that will be familiar to anyone with a grounding in fluid mechanics. It incorporates the latest advances in computational algorithms for elliptic partial differential equations, and provides the groundwork for simulations on high performance parallel computers. Numerous exercises complement the theoretical discussions, and a related set of self-documented programs is available to readers with Internet access. The work will be of interest to advanced students and practicing researchers in mechanical engineering, chemical engineering, applied physics, computational methods, and developers of numerical modeling software.


1996 ◽  
Vol 22 (6) ◽  
pp. 789-828 ◽  
Author(s):  
William Gropp ◽  
Ewing Lusk ◽  
Nathan Doss ◽  
Anthony Skjellum

2013 ◽  
Vol 718-720 ◽  
pp. 1645-1650
Author(s):  
Gen Yin Cheng ◽  
Sheng Chen Yu ◽  
Zhi Yong Wei ◽  
Shao Jie Chen ◽  
You Cheng

Commonly used commercial simulation software SYSNOISE and ANSYS is run on a single machine (can not directly run on parallel machine) when use the finite element and boundary element to simulate muffler effect, and it will take more than ten days, sometimes even twenty days to work out an exact solution as the large amount of numerical simulation. Use a high performance parallel machine which was built by 32 commercial computers and transform the finite element and boundary element simulation software into a program that can running under the MPI (message passing interface) parallel environment in order to reduce the cost of numerical simulation. The relevant data worked out from the simulation experiment demonstrate that the result effect of the numerical simulation is well. And the computing speed of the high performance parallel machine is 25 ~ 30 times a microcomputer.


1998 ◽  
Vol 08 (03) ◽  
pp. 337-350
Author(s):  
Sook-Yeon Kim ◽  
Oh-Heum Kwon ◽  
Kyung-Yong Chwa

Hypermeshes have been given much attention as a versatile interconnection network of parallel computers. A hypermesh is obtained from a mesh by replacing each linear connection with a hyperedge. In this paper, we show how to embed a butterfly or multiple copies of a butterfly into a hypermesh. First, a butterfly B(s) of (s + 1)2s nodes is embedded into a 2s × X hypermesh where X = 2⌊ log 2 s ⌋+ 1. Second, the butterfly B(s) is embedded into a square hypermesh. Third, multiple copies of the butterfly B(s) are embedded into a hypermesh of variable aspect ratio. The efficiency of these embeddings is measured by alignment cost, congestion, and expansion. The alignment cost of all of these embeddings is optimal. The congestion of the first and third embedding is optimal. The expansion of the first and third embedding is one if s = 2k - 1 for some integer k, otherwise, less than two. The expansion of the second embedding is 2 + ∊ (s) where ∊(s) = (2 log (s + 1) + 2)/(s + 1).


Sign in / Sign up

Export Citation Format

Share Document