Application Checkpointing Technique for Self-Healing From Failures in Mobile Grid Computing

2019 ◽  
Vol 11 (2) ◽  
pp. 50-62
Author(s):  
Amit Sadanand Savyanavar ◽  
Vijay Ram Ghorpade

A mobile grid (MG) consists of interconnected mobile devices which are used for high performance computing. Fault tolerance is an important property of mobile computational grid systems for achieving superior arrangement reliability and faster recovery from failures. Since the failure of the resources affects task execution fatally, fault tolerance service is essential to achieve QoS requirement in MG. The faults which occur in MG are link failure, node failure, task failure, limited bandwidth etc. Detecting these failures can help in better utilisation of the resources and timely notification to the user in a MG environment. These failures result in loss of computational results and data. Many algorithms or techniques were proposed for failure handling in traditional grids. The authors propose a checkpointing based failure handling technique which will improve arrangement reliability and failure recovery time for the MG network. Experimentation was conducted by creating a grid of ubiquitously available Android-based mobile phones.

Author(s):  
Simon McIntosh–Smith ◽  
Rob Hunt ◽  
James Price ◽  
Alex Warwick Vesztrocy

High-performance computing systems continue to increase in size in the quest for ever higher performance. The resulting increased electronic component count, coupled with the decrease in feature sizes of the silicon manufacturing processes used to build these components, may result in future exascale systems being more susceptible to soft errors caused by cosmic radiation than in current high-performance computing systems. Through the use of techniques such as hardware-based error-correcting codes and checkpoint-restart, many of these faults can be mitigated at the cost of increased hardware overhead, run-time, and energy consumption that can be as much as 10–20%. Some predictions expect these overheads to continue to grow over time. For extreme scale systems, these overheads will represent megawatts of power consumption and millions of dollars of additional hardware costs, which could potentially be avoided with more sophisticated fault-tolerance techniques. In this paper we present new software-based fault tolerance techniques that can be applied to one of the most important classes of software in high-performance computing: iterative sparse matrix solvers. Our new techniques enables us to exploit knowledge of the structure of sparse matrices in such a way as to improve the performance, energy efficiency, and fault tolerance of the overall solution.


Author(s):  
Marc Casas ◽  
Wilfried N Gansterer ◽  
Elias Wimmer

We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.


2021 ◽  
Author(s):  
Pedro Henrique Di Francia Rosso ◽  
Emilio Francesquini

The Message Passing Interface (MPI) standard is largely used in High-Performance Computing (HPC) systems. Such systems employ a large number of computing nodes. Thus, Fault Tolerance (FT) is a concern since a large number of nodes leads to more frequent failures. Two essential components of FT are Failure Detection (FD) and Failure Propagation (FP). This paper proposes improvements to existing FD and FP mechanisms to provide more portability, scalability, and low overhead. Results show that the methods proposed can achieve better or at least similar results to existing methods while providing portability to any MPI standard-compliant distribution.


Author(s):  
Mohammad Samadi Gharajeh

Grid systems and cloud servers are two distributed networks that deliver computing resources (e.g., file storages) to users' services via a large and often global network of computers. Virtualization technology can enhance the efficiency of these networks by dedicating the available resources to multiple execution environments. This chapter describes applications of virtualization technology in grid systems and cloud servers. It presents different aspects of virtualized networks in systematic and teaching issues. Virtual machine abstraction virtualizes high-performance computing environments to increase the service quality. Besides, grid virtualization engine and virtual clusters are used in grid systems to accomplish users' services in virtualized environments, efficiently. The chapter, also, explains various virtualization technologies in cloud severs. The evaluation results analyze performance rate of the high-performance computing and virtualized grid systems in terms of bandwidth, latency, number of nodes, and throughput.


Author(s):  
ROBERT STEWART ◽  
PATRICK MAIER ◽  
PHIL TRINDER

AbstractReliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.


Author(s):  
B. Meroufel ◽  
G. Belalem

As fault tolerance is the ability of a system to perform its function correctly even in the presence of faults. Therefore, different fault tolerance techniques are critical for improving the efficient utilization of expensive resources in high performance data grid systems. One of the most popular strategies of fault tolerance is the replication, it creates multiple copies of resources in the system and it has been proved to be an effective way to achieve data availability and system reliability. In this paper the authors propose a new adaptive dynamic replication that combines between a replication based on availability and replication based on popularity. The authors' adaptive dynamic replication uses two types of replicas (primary and ordinary) and two types of placement nodes (best client and best responsible nodes) for the new replicas. In addition to the replication, we used other strategies such as fault detection, fault prediction, dynamicity management, self-stabilization. All these services are grouped in one fault tolerance box named Collaborative Services for Fault Tolerance (CSFT) that structure them in hierarchical services and organize the relationships between them.


Author(s):  
David L Hart

TeraGrid has deployed a significant monitoring and accounting infrastructure in order to understand its operational success. In this paper, we present an analysis of the jobs reported by TeraGrid for 2008. We consider the workload from several perspectives: traditional high-performance computing (HPC) workload characteristics; grid-oriented workload characteristics; and finally user- and group-oriented characteristics. We use metrics reported in prior studies of HPC and grid systems in order to understand whether such metrics provide useful information for managing and studying resource federations. This study highlights the importance of distinguishing between analyses of job patterns and work patterns; that small sets of users dominate the workload both in terms of job and work patterns; and that aggregate analyses across even loosely coupled federations, with incomplete information for individual systems, reflect patterns seen in more tightly coupled grids and in single HPC systems.


2013 ◽  
Vol 9 (3) ◽  
pp. 1091-1098 ◽  
Author(s):  
Sukalyan Goswami ◽  
Ajanta De Sarkar

Grid computing or computational grid has become a vast research field in academics. It is a promising platform that provides resource sharing through multi-institutional virtual organizations for dynamic problem solving. Such platforms are much more cost-effective than traditional high performance computing systems. Due to the provision of scalability of resources, these days grid computing has become popular in industry as well. However, computational grid has different constraints and requirements to those of traditional high performance computing systems. In order to fully exploit such grid systems, resource management and scheduling are key challenges, where issues of task allocation and load balancing represent a common problem for most grid systems as because the load scenarios of individual grid resources are dynamic in nature. The objective of this paper is to review different existing load balancing algorithms or techniques applicable in grid computing and propose a layered service oriented framework for computational grid to solve the prevailing problem of dynamic load balancing.


Sign in / Sign up

Export Citation Format

Share Document