Approaches for Parallel Applications Fault Tolerance

HADAB: Enabling Fault Tolerance in Parallel Applications Running in Distributed Environments

Parallel Processing and Applied Mathematics - Lecture Notes in Computer Science ◽

10.1007/978-3-642-31464-3_71 ◽

2012 ◽

pp. 700-709 ◽

Cited By ~ 14

Author(s):

Vania Boccia ◽

Luisa Carracciuolo ◽

Giuliano Laccetti ◽

Marco Lapegna ◽

Valeria Mele

Keyword(s):

Fault Tolerance ◽

Parallel Applications ◽

Distributed Environments

Download Full-text

Providing fault tolerance through invasive computing

it - Information Technology ◽

10.1515/itit-2016-0022 ◽

2016 ◽

Vol 58 (6) ◽

Cited By ~ 2

Author(s):

Vahid Lari ◽

Andreas Weichslgartner ◽

Alexandru Tanase ◽

Michael Witterauf ◽

Faramarz Khosravi ◽

...

Keyword(s):

Fault Tolerance ◽

Fault Tolerant ◽

Parallel Applications ◽

Technology Scaling ◽

The Face ◽

Tightly Coupled ◽

On Chip ◽

Modular Redundancy ◽

Parallel Arrays ◽

Selection Of

AbstractAs a consequence of technology scaling, today's complex multi-processor systems have become more and more susceptible to errors. In order to satisfy reliability requirements, such systems require methods to detect and tolerate errors. This entails two major challenges: (a) providing a comprehensive approach that ensures fault-tolerant execution of parallel applications across different types of resources, and (b) optimizing resource usage in the face of dynamic fault probabilities or with varying fault tolerance needs of different applications. In this paper, we present a holistic and adaptive approach to provide fault tolerance on Multi-Processor System-on-a-Chip (MPSoC) on demand of an application or environmental needs based on invasive computing. We show how invasive computing may provide adaptive fault tolerance on a heterogeneous MPSoC including hardware accelerators and communication infrastructure such as a Network-on-Chip (NoC). In addition, we present (a) compile-time transformations to automatically adopt well-known redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) for fault-tolerant loop execution on a class of massively parallel arrays of processors called as Tightly Coupled Processor Arrays (). Based on timing characteristics derived from our compilation flow, we further develop (b) a reliability analysis guiding the selection of a suitable degree of fault tolerance. Finally, we present (c) a methodology to detect and adaptively mitigate faults in invasive NoCs.

Download Full-text

Performance evaluation of fault tolerance for parallel applications in networked environments

Proceedings of the 1997 International Conference on Parallel Processing (Cat No 97TB100162) ICPP-97 ◽

10.1109/icpp.1997.622663 ◽

2002 ◽

Author(s):

P. Sens ◽

B. Folliot

Keyword(s):

Performance Evaluation ◽

Fault Tolerance ◽

Parallel Applications

Download Full-text

Fault tolerance for parallel applications through replication

Proceedings of ICICS, 1997 International Conference on Information, Communications and Signal Processing. Theme: Trends in Information Systems Engineering and Wireless Multimedia Communications (Cat. No.97TH8237) ◽

10.1109/icics.1997.652234 ◽

2002 ◽

Author(s):

Kam Hong Shum

Keyword(s):

Fault Tolerance ◽

Parallel Applications

Download Full-text

Fault Tolerance Techniques for Distributed, Parallel Applications

Innovative Research and Applications in Next-Generation High Performance Computing - Advances in Systems Analysis, Software Engineering, and High Performance Computing ◽

10.4018/978-1-5225-0287-6.ch009 ◽

2016 ◽

pp. 221-252

Author(s):

Camille Coti

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Fault Tolerant ◽

Distributed Applications ◽

Parallel Applications ◽

Rollback Recovery ◽

Tolerance Mechanisms ◽

Performance Computing

This chapter gives an overview of techniques used to tolerate failures in high-performance distributed applications. We describe basic replication techniques, automatic rollback recovery and application-based fault tolerance. We present the challenges raised specifically by distributed, high performance computing and the performance overhead the fault tolerance mechanisms are likely to cost. Last, we give an example of a fault-tolerant algorithm that exploits specific properties of a recent algorithm.

Download Full-text

Providing fault tolerance in extreme scale parallel applications

Proceedings of the first annual workshop on High performance computing meets databases - HPCDB '11 ◽

10.1145/2125636.2125639 ◽

2011 ◽

Cited By ~ 1

Author(s):

Hubertus Johannes Jacobus van Dam ◽

Abhinav Vishnu ◽

Wibe A. de Jong

Keyword(s):

Fault Tolerance ◽

Parallel Applications ◽

Extreme Scale

Download Full-text

H-RADIC: A Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

Journal of Computer Science and Technology ◽

10.24215/16666038.18.e24 ◽

2018 ◽

Vol 18 (03) ◽

pp. e24

Author(s):

Ambrosio Royo ◽

Jorge Villamayor ◽

Marcela Castro-León ◽

Dolores Rexachs ◽

Emilio Luque

Keyword(s):

Fault Tolerance ◽

Critical Path ◽

Parallel Applications ◽

Cloud Platform ◽

Parallel Application ◽

Virtual Clusters ◽

Cloud Environments ◽

A Site ◽

Multi Cloud

Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it´s execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes parallel applications protected by RADIC in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiment´s results using 3 clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that by adding a cluster protector it will be possible to implement the next level in the hierarchy, where the first level in the RADIC hierarchy works as an observer at a site level. In adition, the experiments showed that the protection implementation is out of the critical path of the application and it depends on the utilized resources.

Download Full-text

A Fault-Tolerance Protocol for Parallel Applications with Communication Imbalance

2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) ◽

10.1109/sbac-pad.2015.25 ◽

2015 ◽

Author(s):

Esteban Meneses ◽

Laxmikant V. Kale

Keyword(s):

Fault Tolerance ◽

Parallel Applications

Download Full-text

A framework backbone for software fault tolerance in embedded parallel applications

Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99 ◽

10.1109/empdp.1999.746666 ◽

1999 ◽

Cited By ~ 3

Author(s):

G. Deconinck ◽

M. Truyens ◽

V. De Florio ◽

W. Rosseel ◽

R. Lauwereins ◽

...

Keyword(s):

Fault Tolerance ◽

Parallel Applications ◽

Software Fault Tolerance ◽

Software Fault

Download Full-text

A Review on Load Balancing Model Using Best Partition Technique

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i8.69 ◽

2017 ◽

Vol 7 (8) ◽

pp. 284

Author(s):

M. Chaitanya ◽

K. Durga Charan

Keyword(s):

Cloud Computing ◽

Fault Tolerance ◽

Load Balancing ◽

Load Balance ◽

Large Impact ◽

Cloud Environment ◽

Public Cloud ◽

The Public ◽

Partition Technique ◽

Textual Content

Load balancing makes cloud computing greater knowledgeable and could increase client pleasure. At reward cloud computing is among the all most systems which offer garage of expertise in very lowers charge and available all the time over the net. However, it has extra vital hassle like security, load administration and fault tolerance. Load balancing inside the cloud computing surroundings has a large impact at the presentation. The set of regulations relates the sport idea to the load balancing manner to amplify the abilties in the public cloud environment. This textual content pronounces an extended load balance mannequin for the majority cloud concentrated on the cloud segregating proposal with a swap mechanism to select specific strategies for great occasions.

Download Full-text