JENERGY: A Fault Tolerant Stateless Architecture for High Performance Computing

Algorithm Based Fault Tolerant and Check Pointing for High Performance Computing Systems

Journal of Applied Sciences ◽

10.3923/jas.2009.3947.3956 ◽

2009 ◽

Vol 9 (22) ◽

pp. 3947-3956 ◽

Cited By ~ 8

Author(s):

Hodjatollah Hamidi ◽

A. Vafaei ◽

A.H. Monadjemi

Keyword(s):

High Performance Computing ◽

High Performance ◽

Fault Tolerant ◽

Computing Systems ◽

Performance Computing

Download Full-text

FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

IEEE Access ◽

10.1109/access.2020.2975832 ◽

2020 ◽

Vol 8 ◽

pp. 42674-42688

Author(s):

Yanchao Zhu ◽

Yi Liu ◽

Guozhen Zhang

Keyword(s):

High Performance Computing ◽

Linear Algebra ◽

High Performance ◽

Fault Tolerant ◽

Computing Systems ◽

Performance Computing ◽

Algebra Computation

Download Full-text

Transparent fault tolerance for scalable functional computation

Journal of Functional Programming ◽

10.1017/s095679681600006x ◽

2016 ◽

Vol 26 ◽

Cited By ~ 2

Author(s):

ROBERT STEWART ◽

PATRICK MAIER ◽

PHIL TRINDER

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Programming Model ◽

Fault Tolerant ◽

Fault Recovery ◽

Actor Model ◽

Work Stealing ◽

Performance Computing

AbstractReliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.

Download Full-text

Optimizing Checkpoint Restart with Data Deduplication

Scientific Programming ◽

10.1155/2016/9315493 ◽

2016 ◽

Vol 2016 ◽

pp. 1-11 ◽

Cited By ~ 6

Author(s):

Zhengyu Chen ◽

Jianhua Sun ◽

Hao Chen

Keyword(s):

Detailed Analysis ◽

High Performance Computing ◽

High Performance ◽

Fault Tolerant ◽

Data Deduplication ◽

Software Faults ◽

Distributed Programs ◽

Computing Systems ◽

Redundancy Elimination ◽

Performance Computing

The increasing scale, such as the size and complexity, of computer systems brings more frequent occurrences of hardware or software faults; thus fault-tolerant techniques become an essential component in high-performance computing systems. In order to achieve the goal of tolerating runtime faults, checkpoint restart is a typical and widely used method. However, the exploding sizes of checkpoint files that need to be saved to external storage pose a major scalability challenge, necessitating the design of efficient approaches to reducing the amount of checkpointing data. In this paper, we first motivate the need of redundancy elimination with a detailed analysis of checkpoint data from real scenarios. Based on the analysis, we apply inline data deduplication to achieve the objective of reducing checkpoint size. We use DMTCP, an open-source checkpoint restart package, to validate our method. Our experiment shows that, by using our method, single-computer programs can reduce the size of checkpoint file by 20% and distributed programs can reduce the size of checkpoint file by 47%.

Download Full-text

nD-RAPID: a multidimensional scalable fault-tolerant optoelectronic interconnection for high-performance computing systems

Journal of Optical Networking ◽

10.1364/jon.6.000465 ◽

2007 ◽

Vol 6 (5) ◽

pp. 465 ◽

Cited By ~ 5

Author(s):

Chander Kochar ◽

Avinash Kodi ◽

Ahmed Louri

Keyword(s):

High Performance Computing ◽

High Performance ◽

Fault Tolerant ◽

Computing Systems ◽

Performance Computing

Download Full-text

Fault Tolerance Techniques for Distributed, Parallel Applications

Innovative Research and Applications in Next-Generation High Performance Computing - Advances in Systems Analysis, Software Engineering, and High Performance Computing ◽

10.4018/978-1-5225-0287-6.ch009 ◽

2016 ◽

pp. 221-252

Author(s):

Camille Coti

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Fault Tolerant ◽

Distributed Applications ◽

Parallel Applications ◽

Rollback Recovery ◽

Tolerance Mechanisms ◽

Performance Computing

This chapter gives an overview of techniques used to tolerate failures in high-performance distributed applications. We describe basic replication techniques, automatic rollback recovery and application-based fault tolerance. We present the challenges raised specifically by distributed, high performance computing and the performance overhead the fault tolerance mechanisms are likely to cost. Last, we give an example of a fault-tolerant algorithm that exploits specific properties of a recent algorithm.

Download Full-text

Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '05 ◽

10.1145/1065944.1065973 ◽

2005 ◽

Cited By ~ 38

Author(s):

Zizhong Chen ◽

Graham E. Fagg ◽

Edgar Gabriel ◽

Julien Langou ◽

Thara Angskun ◽

...

Keyword(s):

High Performance Computing ◽

High Performance ◽

Fault Tolerant ◽

Performance Computing

Download Full-text

High Performance Computing Systems with Various Checkpointing Schemes

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2009.4.2455 ◽

2009 ◽

Vol 4 (4) ◽

pp. 386 ◽

Cited By ~ 10

Author(s):

Nichamon Naksinehaboon ◽

Mihaela P[un ◽

Raja Nassar ◽

Chokchai Box Leangsuksun ◽

Stephen Scott

Keyword(s):

High Performance Computing ◽

Systems Analysis ◽

Completion Time ◽

High Performance ◽

Fault Tolerant ◽

Computing Time ◽

Recovery Period ◽

Computing Systems ◽

Additional Costs ◽

Performance Computing

Finding the failure rate of a system is a crucial step in high performance computing systems analysis. To deal with this problem, a fault tolerant mechanism, called checkpoint/ restart technique, was introduced. However, there are additional costs to perform this mechanism. Thus, we propose two models for different schemes (full and incremental checkpoint schemes). The models which are based on the reliability of the system are used to determine the checkpoint placements. Both proposed models consider a balance of between checkpoint overhead and the re-computing time. Due to the extra costs from each incremental checkpoint during the recovery period, a method to find the number of incremental checkpoints between two consecutive full checkpoints is given. Our simulation suggests that in most cases our incremental checkpoint model can reduce the waste time more than it is reduced by the full checkpoint model. The waste times produced by both models are in the range of 2% to 28% of the application completion time depending on the checkpoint overheads.

Download Full-text

Towards Fault-Tolerant Energy-Efficient High Performance Computing in the Cloud

2012 IEEE International Conference on Cluster Computing ◽

10.1109/cluster.2012.74 ◽

2012 ◽

Cited By ~ 7

Author(s):

Kurt L. Keville ◽

Rohan Garg ◽

David J. Yates ◽

Kapil Arya ◽

Gene Cooperman

Keyword(s):

High Performance Computing ◽

Energy Efficient ◽

High Performance ◽

Fault Tolerant ◽

Performance Computing

Download Full-text

Fault Tolerant Communication in Embedded Parallel High Performance Computing

Parallel Computational Fluid Dynamics 1998 ◽

10.1016/b978-044482850-7/50110-8 ◽

1999 ◽

pp. 405-414

Author(s):

G. Efthivoulidis ◽

E. Verentziotis ◽

A. Meliones ◽

T. Varvarigou ◽

A. Kontizas ◽

...

Keyword(s):

High Performance Computing ◽

High Performance ◽

Fault Tolerant ◽

Performance Computing

Download Full-text