Managing the Execution of Large Scale MPI Applications on Computational Grids

Computational Grid attributed with distributed load sharing has evolved as a platform to large scale problem solving. Grid is a collection of heterogeneous resources, offering services of varying natures, in which jobs are submitted to any of the participating nodes. Scheduling these jobs in such a complex and dynamic environment has many challenges. Reliability analysis of the grid gains paramount importance because grid involves a large number of resources which may fail anytime, making it unreliable. These failures result in wastage of both computational power and money on the scarce grid resources. It is normally desired that the job should be scheduled in an environment that ensures maximum reliability to the job execution. This work presents a reliability based scheduling model for the jobs on the computational grid. The model considers the failure rate of both the software and hardware grid constituents like application demanding execution, nodes executing the job, and the network links supporting data exchange between the nodes. Job allocation using the proposed scheme becomes trusted as it schedules the job based on a priori reliability computation.

Download Full-text

ACURDION: An adaptive clustering-based algorithm for tracing large-scale MPI applications

2015 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2015.7363823 ◽

2015 ◽

Cited By ~ 1

Author(s):

Amir Bahmani ◽

Frank Mueller

Keyword(s):

Large Scale ◽

Adaptive Clustering ◽

Mpi Applications

Download Full-text

VERIFYING VERY LARGE INDUSTRIAL CIRCUITS USING 100 PROCESSES AND BEYOND

International Journal of Foundations of Computer Science ◽

10.1142/s0129054107004565 ◽

2007 ◽

Vol 18 (01) ◽

pp. 45-61 ◽

Cited By ~ 2

Author(s):

LIMOR FIX ◽

ORNA GRUMBERG ◽

AMNON HEYMAN ◽

TAMIR HEYMAN ◽

ASSAF SCHUSTER

Keyword(s):

High Performance ◽

Large Scale ◽

Real Life ◽

Distributed Model ◽

Computational Grids ◽

Great Promise ◽

Model Checker ◽

Symbolic Model ◽

Computing Platforms ◽

Industrial Circuits

Recent advances in scheduling and networking have paved the way for efficient exploitation of large-scale distributed computing platforms such as computational grids and huge clusters. Such infrastructures hold great promise for the highly resource-demanding task of verifying and checking large models, given that model checkers would be designed with a high degree of scalability and flexibility in mind. In this paper we focus on the mechanisms required to execute a high-performance, distributed, symbolic model checker on top of a large-scale distributed environment. We develop a hybrid algorithm for slicing the state space and dynamically distribute the work among the worker processes. We show that the new approach is faster, more effective, and thus much more scalable than previous slicing algorithms. We then present a checkpoint-restart module that has very low overhead. This module can be used to combat failures, the likelihood of which increases with the size of the computing plat-form. However, checkpoint-restart is even more handy for the scheduling system: it can be used to avoid reserving large numbers of workers, thus making the distributed computation work-efficient. Finally, we discuss for the first time the effect of reorder on the distributed model checker and show how the distributed system performs more efficient reordering than the sequential one. We implemented our contributions on a network of 200 processors, using a distributed scalable scheme that employs a high-performance industrial model checker from Intel. Our results show that the system was able to verify real-life models much larger than was previously possible.

Download Full-text

An integrated security-aware job scheduling strategy for large-scale computational grids

Future Generation Computer Systems ◽

10.1016/j.future.2009.08.004 ◽

2010 ◽

Vol 26 (2) ◽

pp. 198-206 ◽

Cited By ~ 19

Author(s):

Chao-Chin Wu ◽

Ren-Yi Sun

Keyword(s):

Large Scale ◽

Job Scheduling ◽

Computational Grids ◽

Scheduling Strategy

Download Full-text

SCHEDULING WITH JOB CHECKPOINT IN COMPUTATIONAL GRID ENVIRONMENT

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962311000517 ◽

2011 ◽

Vol 02 (03) ◽

pp. 299-316

Author(s):

MALARVIZHI NANDAGOPAL ◽

S. GAJALAKSHMI ◽

V. RHYMEND UTHARIARAJ

Keyword(s):

Fault Tolerance ◽

Large Scale ◽

Job Scheduling ◽

Fault Tolerant ◽

Scheduling Algorithm ◽

Computational Grids ◽

Tolerance Mechanism ◽

Grid Resource ◽

Distributed Resources ◽

Grid Environment

Computational grids have the potential for solving large-scale scientific applications using heterogeneous and geographically distributed resources. In addition to the challenges of managing and scheduling these applications, reliability challenges arise because of the unreliable nature of grid infrastructure. Two major problems that are critical to the effective utilization of computational resources are efficient scheduling of jobs and providing fault tolerance in a reliable manner. This paper addresses these problems by combining the checkpoint replication based fault tolerance mechanism with minimum total time to release (MTTR) job scheduling algorithm. TTR includes the service time of the job, waiting time in the queue, transfer of input and output data to and from the resource. The MTTR algorithm minimizes the response time by selecting a computational resource based on job requirements, job characteristics, and hardware features of the resources. The fault tolerance mechanism used here sets the job checkpoints based on the resource failure rate. If resource failure occurs, the job is restarted from its last successful state using a checkpoint file from another grid resource. Globus ToolKit is used as the grid middleware to set up a grid environment and evaluate the performance of the proposed approach. The monitoring tools Ganglia and Network Weather Service are used to gather hardware and network details, respectively. The experimental results demonstrate that, the proposed approach effectively schedule the grid jobs with fault-tolerant way thereby reduces TTR of the jobs submitted in the grid. Also, it increases the percentage of jobs completed within specified deadline and making the grid trustworthy.

Download Full-text

Fastmap: a distributed scheme for mapping large scale applications onto computational grids

Proceedings of the Second International Workshop on Challenges of Large Applications in Distributed Environments, 2004. CLADE 2004. ◽

10.1109/clade.2004.1309098 ◽

2004 ◽

Cited By ~ 6

Author(s):

A. Jain ◽

S. Sanyal ◽

S.K. Das ◽

R. Biswas

Keyword(s):

Large Scale ◽

Computational Grids ◽

Distributed Scheme

Download Full-text

PMaC's green queue: a framework for selecting energy optimal DVFS configurations in large scale MPI applications

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.3184 ◽

2013 ◽

Vol 28 (2) ◽

pp. 211-231 ◽

Cited By ~ 10

Author(s):

Joshua Peraza ◽

Ananta Tiwari ◽

Michael Laurenzano ◽

Laura Carrington ◽

Allan Snavely

Keyword(s):

Large Scale ◽

Mpi Applications

Download Full-text

Contention-free Routing for Shift-based Communication in MPI Applications on Large-scale Infiniband Clusters

10.2172/967277 ◽

2009 ◽

Cited By ~ 2

Author(s):

A. Moody

Keyword(s):

Large Scale ◽

Infiniband Clusters ◽

Mpi Applications

Download Full-text

CAPM Indexed Hybrid E-Negotiation for Resource Allocation in Grid Computing

International Journal of Grid and High Performance Computing ◽

10.4018/jghpc.2013040105 ◽

2013 ◽

Vol 5 (2) ◽

pp. 72-91 ◽

Cited By ~ 1

Author(s):

Ashiqur Md. Rahman ◽

Rashedur M Rahman

Keyword(s):

Resource Allocation ◽

Grid Computing ◽

Dynamic Pricing ◽

Large Scale ◽

Service Providers ◽

Computational Grids ◽

Grid Environment ◽

Expected Return ◽

Grid Systems ◽

Equilibrium Relationship

Computational Grids are a promising platform for executing large-scale resource intensive applications. This paper identifies challenges in managing resources in a Grid computing environment and proposes computational economy as a metaphor for effective management of resources and application scheduling. It identifies distributed resource management challenges and requirements of economy-based Grid systems, and proposes an economy based negotiation system protocol for cooperative and competitive trading of resources. Dynamic pricing for services and good level of Pareto optimality make auctions more attractive for resource allocation over other economic models. In a complex Grid environment, the communication demand can become a bottleneck; that is, a number of messages need to be exchanged for matching suitable service providers and consumers. The Fuzzy Trust integrated hybrid Capital Asset Pricing Model (CAPM) shows the higher user centric satisfaction and provides the equilibrium relationship between the expected return and risk on investments. This paper also presents an analysis on the communication requirements and the necessity of the CAPMAuction in Grid environment.

Download Full-text

Gridless simulations of splashing processes and near-shore bore propagation

Journal of Fluid Mechanics ◽

10.1017/s0022112007008142 ◽

2007 ◽

Vol 591 ◽

pp. 183-213 ◽

Cited By ~ 44

Author(s):

M. LANDRINI ◽

A. COLAGROSSI ◽

M. GRECO ◽

M. P. TULIN

Keyword(s):

Numerical Solution ◽

Large Scale ◽

Momentum Conservation ◽

Vortical Flow ◽

Computational Grids ◽

Jet Formation ◽

Number Range ◽

Theoretical Concepts ◽

Particle Hydrodynamics ◽

Smoothed Particle

The generation and evolution of two-dimensional bores in water of uniform depth and on sloping beaches are simulated through numerical solution of the Euler equations using the smoothed particle hydrodynamics (SPH) method, wherein particles are followed in Lagrangian fashion, avoiding the need for computational grids. In water of uniform depth, a piston wavemaker produces cyclically breaking bores in the Froude number range 1.37–1.82, which were shown to move at time-averaged speeds in very good agreement with the requirements of global mass and momentum conservation. A single Strouhal number for the breaking period was discovered. Complex repetitive splashing patterns are observed and described, involving forward jet formation growth, impact and ricochet, and similarly, backward jet formation and impact. Observed consequences were the creation of vortical regions of both signs, dipole creation through pairing, large-scale transport of surface water downward and high tangential scouring velocities on the bed, which are quantified. These bores are further allowed to rise on linear slopes to the shoreline, where they are seen to collapse into a tongue-like flow resembling dam-break evolution.This essentially inviscid calculation is able to reproduce the development of a highly vortical flow in excellent agreement with experimental observations and theoretical concepts. The turbulent flow behaviour is partially described by the numerical solution.

Download Full-text