Analysis and Evaluation of a New Algorithm Based Fault Tolerance for Computing Systems

2012 ◽  
Vol 4 (1) ◽  
pp. 37-51 ◽  
Author(s):  
Hodjat Hamidi ◽  
Abbas Vafaei ◽  
Seyed Amir Hassan Monadjemi

In this paper, the authors present a new approach to algorithm based fault tolerance (ABFT) for High Performance computing system. The Algorithm Based Fault Tolerance approach transforms a system that does not tolerate a specific type of fault, called the fault-intolerant system, to a system that provides a specific level of fault tolerance, namely recovery. The ABFT techniques that detect errors rely on the comparison of parity values computed in two ways, the parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs, can apply convolution codes for the redundancy. This method is a new approach to concurrent error correction in fault-tolerant computing systems. This paper proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. The authors also present, implement, and evaluate early detection in ABFT.

Author(s):  
Hodjatollah Hamidi

The Algorithm-Based Fault Tolerance (ABFT) approach transforms a system that does not tolerate a specific type of faults, called the fault-intolerant system, to a system that provides a specific level of fault tolerance, namely recovery. The ABFT philosophy leads directly to a model from which error correction can be developed. By employing an ABFT scheme with effective convolutional code, the design allows high throughput as well as high fault coverage. The ABFT techniques that detect errors rely on the comparison of parity values computed in two ways. The parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs and can apply convolutional codes for the redundancy. This method is a new approach to concurrent error correction in fault-tolerant computing systems. This chapter proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. The authors also present, implement, and evaluate early detection in ABFT.


Author(s):  
K. I. Volovich ◽  
S. A. Denisov ◽  
S. I. Malkovsky

The article is devoted to the problem of solving scientific problems in the field of high-performance computing systems. An approach to solving a certain kind of problems in materials science is the use of mathematical modeling technologies implemented by specialized modeling systems. The greatest efficiency of the modeling system is shown when deployed in hybrid high-performance computing systems (HHPC), which have high performance and allow solving problems in an acceptable time with sufficient accuracy. However, there are a number of limitations that affect the work of the research team with modeling systems in the HHPC computing environment: the need to access graphics accelerators at the stage of development and debugging of algorithms in the modeling system, the need to use several modeling systems in order to obtain the most optimal solution, the need to dynamically change settings modeling systems for solving problems. The solution to the problem of the above limitations is assigned to an individual modeling environment functioning in the HHPC computing environment. The optimal solution for creating an individual modeling environment is the technology of virtual containerization. An algorithm for the formation of an individual modeling environment in a hybrid high-performance computing complex based on the «docker» virtual containerization system is proposed. An individual modeling environment is created by installing the necessary software in the base container, setting environment variables, installing custom software and licenses. A feature of the algorithm is the ability to form a library image from a base container with a customized individual modeling environment. In conclusion, the direction for further research work is indicated. The algorithm presented in the article is independent of the implementation of the job management system and can be used for any high-performance computing system.


SPIN ◽  
2013 ◽  
Vol 03 (04) ◽  
pp. 1340012 ◽  
Author(s):  
HAO MENG ◽  
GUCHANG HAN

High performance computing system design based on complementary metal oxide semiconductor (CMOS) is facing more and more challenges due to the volatility, increased leak current and interconnection delay. Computations utilizing magnetic logic devices have attracted considerable interest as the potential alternatives because of their features of nonvolatility, re-configurability, unlimited endurance and low power consumption. Instead of using electron charges, the magnetic logic device stores and processes the data information by controlling spins, i.e., the magnetization states in a device. The emerging technologies related to the magnetic logic are mainly composed of three design schemes, i.e., the magnetoresistive logic, the magnetic quantum cellular automata and the magnetic domain wall logic. This paper will illustrate the principles as well as review the recent developments of these magnetic logic devices. Challenges and prospects of the future development are also discussed.


Author(s):  
Apolinar Velarde Martinez

Increasingly complex algorithms for the modeling and resolution of different problems, which are currently facing humanity, has made it necessary the advent of new data processing requirements and the consequent implementation of high performance computing systems; but due to the high economic cost of this type of equipment and considering that an education institution cannot acquire, it is necessary to develop and implement computable architectures that are economical and scalable in their construction, such as heterogeneous distributed computing systems, constituted by several clustering of multicore processing elements with shared and distributed memory systems. This paper presents the analysis, design and implementation of a high-performance computing system called Liebres InTELigentes, whose purpose is the design and execution of intrinsically parallel algorithms, which require high amounts of storage and excessive processing times. The proposed computer system is constituted by conventional computing equipment (desktop computers, lap top equipment and servers), linked by a high-speed network. The main objective of this research is to build technology for the purposes of scientific and educational research.


2019 ◽  
Author(s):  
Weiming Hu ◽  
Guido Cervone ◽  
Vivek Balasubramanian ◽  
Matteo Turilli ◽  
Shantenu Jha

Author(s):  
Simon McIntosh–Smith ◽  
Rob Hunt ◽  
James Price ◽  
Alex Warwick Vesztrocy

High-performance computing systems continue to increase in size in the quest for ever higher performance. The resulting increased electronic component count, coupled with the decrease in feature sizes of the silicon manufacturing processes used to build these components, may result in future exascale systems being more susceptible to soft errors caused by cosmic radiation than in current high-performance computing systems. Through the use of techniques such as hardware-based error-correcting codes and checkpoint-restart, many of these faults can be mitigated at the cost of increased hardware overhead, run-time, and energy consumption that can be as much as 10–20%. Some predictions expect these overheads to continue to grow over time. For extreme scale systems, these overheads will represent megawatts of power consumption and millions of dollars of additional hardware costs, which could potentially be avoided with more sophisticated fault-tolerance techniques. In this paper we present new software-based fault tolerance techniques that can be applied to one of the most important classes of software in high-performance computing: iterative sparse matrix solvers. Our new techniques enables us to exploit knowledge of the structure of sparse matrices in such a way as to improve the performance, energy efficiency, and fault tolerance of the overall solution.


2017 ◽  
Vol 33 (2) ◽  
pp. 119-130
Author(s):  
Vinh Van Le ◽  
Hoai Van Tran ◽  
Hieu Ngoc Duong ◽  
Giang Xuan Bui ◽  
Lang Van Tran

Metagenomics is a powerful approach to study environment samples which do not require the isolation and cultivation of individual organisms. One of the essential tasks in a metagenomic project is to identify the origin of reads, referred to as taxonomic assignment. Due to the fact that each metagenomic project has to analyze large-scale datasets, the metatenomic assignment is very much computation intensive. This study proposes a parallel algorithm for the taxonomic assignment problem, called SeMetaPL, which aims to deal with the computational challenge. The proposed algorithm is evaluated with both simulated and real datasets on a high performance computing system. Experimental results demonstrate that the algorithm is able to achieve good performance and utilize resources of the system efficiently. The software implementing the algorithm and all test datasets can be downloaded at http://it.hcmute.edu.vn/bioinfo/metapro/SeMetaPL.html.


Sign in / Sign up

Export Citation Format

Share Document