Adaptive Erasure Coded Fault Tolerant Linear System Solver

Xuejiao Kang; David F. Gleich; Ahmed Sameh; Ananth Grama

doi:10.1145/3490557

Adaptive Erasure Coded Fault Tolerant Linear System Solver

ACM Transactions on Parallel Computing ◽

10.1145/3490557 ◽

2021 ◽

Vol 8 (4) ◽

pp. 1-19

Author(s):

Xuejiao Kang ◽

David F. Gleich ◽

Ahmed Sameh ◽

Ananth Grama

Keyword(s):

Fault Tolerance ◽

Linear System ◽

Fault Tolerant ◽

Problem Instance ◽

Minimal Amount ◽

True Solution ◽

Good Convergence ◽

Redundant Data ◽

Augmentation Techniques ◽

Parallel Formulation

As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem. In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.

Download Full-text

Fault-Tolerant Protocols Using Compilers and Translators

Application-Layer Fault-Tolerance Protocols ◽

10.4018/978-1-60566-182-7.ch004 ◽

2009 ◽

pp. 133-160

Author(s):

Vincenzo De Florio

Keyword(s):

Fault Tolerance ◽

Programming Languages ◽

Data Structures ◽

Error Detection ◽

Fault Tolerant ◽

Error Recovery ◽

Application Layer ◽

Redundant Data ◽

New Research ◽

The University

In this chapter our survey of methods and structures for application-level fault-tolerance continues, getting closer to the programming language: Indeed, tools such as compilers and translators work at the level of the language—they parse, interpret, compile or transform our programs, so they are interesting candidates for managing dependability aspects in the application layer. An important property of this family of methods is the fact that fault-tolerance complexity is extracted from the program and turned into architectural complexity in the compiler or the translator. Apart from continuing with our survey, this chapter also aims at providing the reader with two practical examples: • Reflective and refractive variables, that is, a syntactical structure to express adaptive feedback loops in the application layer. This is useful to resilient computing because a feedback loop can attach error recovery strategies to error detection events. • Redundant variables, that is, a tool that allows designers to make use of adaptively redundant data structures with commodity programming languages such as C or Java. Designers using such tools can define redundant data structures in which the degree of redundancy is not fixed once and for all at design time, but rather it changes dynamically with respect to the disturbances experienced during the run time. Both tools are new research activities that are currently being carried out by the author of this book at the PATS research group of the University of Antwerp. It is shown how through a simple translation approach it is possible to provide sophisticated features such as adaptive fault-tolerance to programs written in any language, even plain old C.

Download Full-text

Fault Tolerant Guidance of Under-Actuated Satellite Formation Flying Using Inter-Vehicle Coulomb Force

10.30699/ijrrs.2.1.6 ◽

2019 ◽

Vol 2 (1) ◽

pp. 43-52

Author(s):

Alireza Alikhani ◽

Safa Dehghan M ◽

Iman Shafieenejad

Keyword(s):

Fault Tolerance ◽

Formation Flying ◽

Fault Tolerant ◽

Coulomb Force ◽

Control Law ◽

Triangular Form ◽

Satellite Formation Flying ◽

Satellite Formation ◽

Guidance Law ◽

Tolerance Approach

In this study, satellite formation flying guidance in the presence of under actuation using inter-vehicle Coulomb force is investigated. The Coulomb forces are used to stabilize the formation flying mission. For this purpose, the charge of satellites is determined to create appropriate attraction and repulsion and also, to maintain the distance between satellites. Static Coulomb formation of satellites equations including three satellites in triangular form was developed. Furthermore, the charge value of the Coulomb propulsion system required for such formation was obtained. Considering Under actuation of one of the formation satellites, the fault-tolerance approach is proposed for achieving mission goals. Following this approach, in the first step fault-tolerant guidance law is designed. Accordingly, the obtained results show stationary formation. In the next step, tomaintain the formation shape and dimension, a fault-tolerant control law is designed.

Download Full-text

Fault Tolerant Reliable Protocol (FTRP) Performance Evaluation in Wireless Sensor Networks: An Extensitive Study

Journal of Electronics and Sensors ◽

10.31829/2689-6958/jes2019-2(1)-107 ◽

2019 ◽

pp. 1-36

Keyword(s):

Wireless Sensor Networks ◽

Fault Tolerance ◽

Sensor Networks ◽

Fault Tolerant ◽

Cluster Head ◽

Biological Data ◽

Wireless Sensor ◽

Delivery Ratio ◽

Good Performer ◽

Wide Range

Fault Tolerant Reliable Protocol (FTRP) is proposed as a novel routing protocol designed for Wireless Sensor Networks (WSNs). FTRP offers fault tolerance reliability for packet exchange and support for dynamic network changes. The key concept used is the use of node logical clustering. The protocol delegates the routing ownership to the cluster heads where fault tolerance functionality is implemented. FTRP utilizes cluster head nodes along with cluster head groups to store packets in transient. In addition, FTRP utilizes broadcast, which reduces the message overhead as compared to classical flooding mechanisms. FTRP manipulates Time to Live values for the various routing messages to control message broadcast. FTRP utilizes jitter in messages transmission to reduce the effect of synchronized node states, which in turn reduces collisions. FTRP performance has been extensively through simulations against Ad-hoc On-demand Distance Vector (AODV) and Optimized Link State (OLSR) routing protocols. Packet Delivery Ratio (PDR), Aggregate Throughput and End-to-End delay (E-2-E) had been used as performance metrics. In terms of PDR and aggregate throughput, it is found that FTRP is an excellent performer in all mobility scenarios whether the network is sparse or dense. In stationary scenarios, FTRP performed well in sparse network; however, in dense network FTRP’s performance had degraded yet in an acceptable range. This degradation is attributed to synchronized nodes states. Reliably delivering a message comes to a cost, as in terms of E-2-E. results show that FTRP is considered a good performer in all mobility scenarios where the network is sparse. In sparse stationary scenario, FTRP is considered good performer, however in dense stationary scenarios FTRP’s E-2-E is not acceptable. There are times when receiving a network message is more important than other costs such as energy or delay. That makes FTRP suitable for wide range of WSNs applications, such as military applications by monitoring soldiers’ biological data and supplies while in battlefield and battle damage assessment. FTRP can also be used in health applications in addition to wide range of geo-fencing, environmental monitoring, resource monitoring, production lines monitoring, agriculture and animals tracking. FTRP should be avoided in dense stationary deployments such as, but not limited to, scenarios where high application response is critical and life endangering such as biohazards detection or within intensive care units.

Download Full-text

Fault Analysis and Non-Redundant Fault Tolerance in 3-Level Double Conversion UPS Systems Using Finite-Control-Set Model Predictive Control

Energies ◽

10.3390/en14082210 ◽

2021 ◽

Vol 14 (8) ◽

pp. 2210

Author(s):

Luís Caseiro ◽

André Mendes

Keyword(s):

Fault Tolerance ◽

Model Predictive Control ◽

Predictive Control ◽

Fault Tolerant ◽

Finite Control ◽

Control Set ◽

Corrective Actions ◽

Uninterruptible Power ◽

Hardware Reconfiguration ◽

Redundant Fault

Fault-tolerance is critical in power electronics, especially in Uninterruptible Power Supplies, given their role in protecting critical loads. Hence, it is crucial to develop fault-tolerant techniques to improve the resilience of these systems. This paper proposes a non-redundant fault-tolerant double conversion uninterruptible power supply based on 3-level converters. The proposed solution can correct open-circuit faults in all semiconductors (IGBTs and diodes) of all converters of the system (including the DC-DC converter), ensuring full-rated post-fault operation. This technique leverages the versatility of Finite-Control-Set Model Predictive Control to implement highly specific fault correction. This type of control enables a conditional exclusion of the switching states affected by each fault, allowing the converter to avoid these states when the fault compromises their output but still use them in all other conditions. Three main types of corrective actions are used: predictive controller adaptations, hardware reconfiguration, and DC bus voltage adjustment. However, highly differentiated corrective actions are taken depending on the fault type and location, maximizing post-fault performance in each case. Faults can be corrected simultaneously in all converters, as well as some combinations of multiple faults in the same converter. Experimental results are presented demonstrating the performance of the proposed solution.

Download Full-text

A Fault tolerant architecture of nine-level inverter with single and multiple switch fault-tolerance capabilities

2020 IEEE International Conference on Power Electronics, Drives and Energy Systems (PEDES) ◽

10.1109/pedes49360.2020.9379860 ◽

2020 ◽

Author(s):

Chiranjeevi Sadanala ◽

Swapnajit Pattnaik ◽

Vinay Pratap Singh

Keyword(s):

Fault Tolerance ◽

Fault Tolerant

Download Full-text

An FTC Design via Multiple SOGIs with Suppression of Harmonic Disturbances for Five-Phase PMSG-Based Tidal Current Applications

Journal of Marine Science and Engineering ◽

10.3390/jmse9060574 ◽

2021 ◽

Vol 9 (6) ◽

pp. 574

Author(s):

Zhuo Liu ◽

Tianhao Tang ◽

Azeddine Houari ◽

Mohamed Machmoum ◽

Mohamed Fouad Benkhoris

Keyword(s):

Fault Tolerance ◽

Energy Conversion ◽

Tidal Current ◽

Fault Tolerant ◽

Tidal Current Energy ◽

Electromagnetic Forces ◽

Model Free ◽

Energy Conversion Systems ◽

Power Scale ◽

Current Energy

This paper firstly adopts a fault accommodation structure, a five-phase permanent magnet synchronous generator (PMSG) with trapezoidal back-electromagnetic forces, in order to enhance the fault tolerance of tidal current energy conversion systems. Meanwhile, a fault-tolerant control (FTC) method is proposed using multiple second-order generalized integrators (multiple SOGIs) to further improve the systematic fault tolerance. Then, additional harmonic disturbances from phase current or back-electromagnetic forces in original and Park’s frames are characterized under a single-phase open condition. Relying on a classical field-oriented vector control scheme, fault-tolerant composite controllers are then reconfigured using multiple SOGIs by compensating q-axis control commands. Finally, a real power-scale simulation setup with a gearless back-to-back tidal current energy conversion chain and a small power-scale laboratory prototype in machine side are established to comprehensively validate feasibility and fault tolerance of the proposed method. Simulation results show that the proposed method is able to suppress the main harmonic disturbances and maintain a satisfactory fault tolerance when third harmonic flux varies. Experimental results reveal that the proposed model-free fault-tolerant design is simple to implement, which contributes to better fault-tolerant behaviors, higher power quality and lower copper losses. The main advantage of the multiple SOGIs lies in convenient online implementation and efficient multi-harmonic extractions, without considering system’s model parameters. The proposed FTC design provides a model-free fault-tolerant solution to the energy harvested process of actual tidal current energy conversion systems under different working conditions.

Download Full-text

A Design for Fault-Tolerant Communication Middleware Based on Time-Triggered

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.548-549.1326 ◽

2014 ◽

Vol 548-549 ◽

pp. 1326-1329

Author(s):

Juan Jin ◽

Qing Fan Gu

Keyword(s):

Fault Tolerance ◽

Real Time ◽

Fault Location ◽

Fault Tolerant ◽

Application Software ◽

Tolerance Mechanisms ◽

The Real ◽

Communication Middleware ◽

Fault Mechanism ◽

Health Diagnosis

Against to the unsustainable problems of health diagnosis, fault location and fault tolerance mechanisms that existing in the current avionics applications, we proposed a fault-tolerant communication middleware which is based on time-triggered in this paper. This middleware is designed to provide a support platform for applications of the real-time based on communication middleware. From the communication middleware level and also combined with time-triggered mechanism and fault-tolerant strategy, it diagnoses the general faults first, and then routes them to the appropriate fault mechanism to process it. So the middleware completely separates fault-tolerant process from the application software functions.

Download Full-text

Efficient Fault Tolerance on Cloud Environments

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2018070102 ◽

2018 ◽

Vol 8 (3) ◽

pp. 20-31 ◽

Cited By ~ 3

Author(s):

Sam Goundar ◽

Akashdeep Bhardwaj

Keyword(s):

Fault Tolerance ◽

Web Applications ◽

Fault Tolerant ◽

Cloud Services ◽

Level Of Service ◽

Cloud Environments ◽

Computing Environments ◽

Service Assurance ◽

Mission Critical ◽

The Impact

With mission critical web applications and resources being hosted on cloud environments, and cloud services growing fast, the need for having greater level of service assurance regarding fault tolerance for availability and reliability has increased. The high priority now is ensuring a fault tolerant environment that can keep the systems up and running. To minimize the impact of downtime or accessibility failure due to systems, network devices or hardware, the expectations are that such failures need to be anticipated and handled proactively in fast, intelligent way. This article discusses the fault tolerance system for cloud computing environments, analyzes whether this is effective for Cloud environments.

Download Full-text

A Fault-Tolerant Data Center Network Structure

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.734-737.3048 ◽

2013 ◽

Vol 734-737 ◽

pp. 3048-3052

Author(s):

Peng Wang ◽

Yan Lv ◽

Yu Tan

Keyword(s):

Fault Tolerance ◽

Network Structure ◽

Data Center ◽

Fault Tolerant ◽

Data Center Network ◽

Advantages And Disadvantages

According to advantages and disadvantages of the traditional data center network structure ,this paper propose a new data center network structure base on BCube and DCell. The new structure is mainly improved based on the scalability, fault tolerance, the throughput

Download Full-text

Issues of organizing computations in multicomputersystems with the software-controlled failure- and fault-tolerance. Part 1

Engineering Journal Science and Innovation ◽

10.18698/2308-6033-2021-6-2088 ◽

2021 ◽

Author(s):

I.V. Asharina

Keyword(s):

Fault Tolerance ◽

Critical Level ◽

Fault Tolerant ◽

Communication Channels ◽

State Standards ◽

Modern Concept ◽

Multiple Faults ◽

Point To Point ◽

Definition Of ◽

Test Diagnostics

This three-part paper analyzes existing approaches and methods of organizing failure- and fault-tolerant computing in distributed multicomputer systems (DMCS), identifies and provides rationale for a list of issues to be solved. We present the concept of fault tolerance proposed by A. Avizienis, explicate its dissimilarity from the modern concept and the reason for its inapplicability with regard to modern distributed multicomputer systems. We justify the necessity to refine the definition of fault tolerance approved by the State Standards, as well as the necessity to specify three input parameters to be taken into account in the DMCS design methods: permitted fault models, permitted multiplicity of faults, permitted fault sequence capabilities. We formulate the questions that must be answered in order to design a truly reliable, fault-tolerant system and consider the application areas of the failure- and fault-tolerant control systems for complex network and distributed objects. System, functional, and test diagnostics serve as the basis for building unattended failure- and fault-tolerant systems. The concept of self-managed degradation (with the DMCS eventually proceeding to a safe shutdown at a critical level of degradation) is a means to increase the DMCS active life. We consider the issues related to the diagnosis of multiple faults and present the main differences in ensuring fault tolerance between systems with broadcast communication channels and systems with point-to-point communication channels. The first part of the work mainly deals with the analysis of existing approaches and methods of organizing failure- and fault-tolerant computing in DMCS and the definition of the concept of fault-tolerance.

Download Full-text