Approaches for Parallel Applications Fault Tolerance

Author(s):  
Richard L. Graham
2016 ◽  
Vol 58 (6) ◽  
Author(s):  
Vahid Lari ◽  
Andreas Weichslgartner ◽  
Alexandru Tanase ◽  
Michael Witterauf ◽  
Faramarz Khosravi ◽  
...  

AbstractAs a consequence of technology scaling, today's complex multi-processor systems have become more and more susceptible to errors. In order to satisfy reliability requirements, such systems require methods to detect and tolerate errors. This entails two major challenges: (a) providing a comprehensive approach that ensures fault-tolerant execution of parallel applications across different types of resources, and (b) optimizing resource usage in the face of dynamic fault probabilities or with varying fault tolerance needs of different applications. In this paper, we present a holistic and adaptive approach to provide fault tolerance on Multi-Processor System-on-a-Chip (MPSoC) on demand of an application or environmental needs based on invasive computing. We show how invasive computing may provide adaptive fault tolerance on a heterogeneous MPSoC including hardware accelerators and communication infrastructure such as a Network-on-Chip (NoC). In addition, we present (a) compile-time transformations to automatically adopt well-known redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) for fault-tolerant loop execution on a class of massively parallel arrays of processors called as Tightly Coupled Processor Arrays (). Based on timing characteristics derived from our compilation flow, we further develop (b) a reliability analysis guiding the selection of a suitable degree of fault tolerance. Finally, we present (c) a methodology to detect and adaptively mitigate faults in invasive NoCs.


Author(s):  
Camille Coti

This chapter gives an overview of techniques used to tolerate failures in high-performance distributed applications. We describe basic replication techniques, automatic rollback recovery and application-based fault tolerance. We present the challenges raised specifically by distributed, high performance computing and the performance overhead the fault tolerance mechanisms are likely to cost. Last, we give an example of a fault-tolerant algorithm that exploits specific properties of a recent algorithm.


2018 ◽  
Vol 18 (03) ◽  
pp. e24
Author(s):  
Ambrosio Royo ◽  
Jorge Villamayor ◽  
Marcela Castro-León ◽  
Dolores Rexachs ◽  
Emilio Luque

Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it´s execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes parallel applications protected by RADIC in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiment´s results using 3 clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that by adding a cluster protector it will be possible to implement the next level in the hierarchy, where the first level in the RADIC hierarchy works as an observer at a site level. In adition, the experiments showed that the protection implementation is out of the critical path of the application and it depends on the utilized resources.


Author(s):  
M. Chaitanya ◽  
K. Durga Charan

Load balancing makes cloud computing greater knowledgeable and could increase client pleasure. At reward cloud computing is among the all most systems which offer garage of expertise in very lowers charge and available all the time over the net. However, it has extra vital hassle like security, load administration and fault tolerance. Load balancing inside the cloud computing surroundings has a large impact at the presentation. The set of regulations relates the sport idea to the load balancing manner to amplify the abilties in the public cloud environment. This textual content pronounces an extended load balance mannequin for the majority cloud concentrated on the cloud segregating proposal with a swap mechanism to select specific strategies for great occasions.


Sign in / Sign up

Export Citation Format

Share Document