Adaptive Erasure Coded Fault Tolerant Linear System Solver

2021 ◽  
Vol 8 (4) ◽  
pp. 1-19
Author(s):  
Xuejiao Kang ◽  
David F. Gleich ◽  
Ahmed Sameh ◽  
Ananth Grama

As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem. In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.

Author(s):  
Vincenzo De Florio

In this chapter our survey of methods and structures for application-level fault-tolerance continues, getting closer to the programming language: Indeed, tools such as compilers and translators work at the level of the language—they parse, interpret, compile or transform our programs, so they are interesting candidates for managing dependability aspects in the application layer. An important property of this family of methods is the fact that fault-tolerance complexity is extracted from the program and turned into architectural complexity in the compiler or the translator. Apart from continuing with our survey, this chapter also aims at providing the reader with two practical examples: • Reflective and refractive variables, that is, a syntactical structure to express adaptive feedback loops in the application layer. This is useful to resilient computing because a feedback loop can attach error recovery strategies to error detection events. • Redundant variables, that is, a tool that allows designers to make use of adaptively redundant data structures with commodity programming languages such as C or Java. Designers using such tools can define redundant data structures in which the degree of redundancy is not fixed once and for all at design time, but rather it changes dynamically with respect to the disturbances experienced during the run time. Both tools are new research activities that are currently being carried out by the author of this book at the PATS research group of the University of Antwerp. It is shown how through a simple translation approach it is possible to provide sophisticated features such as adaptive fault-tolerance to programs written in any language, even plain old C.


2019 ◽  
Vol 2 (1) ◽  
pp. 43-52
Author(s):  
Alireza Alikhani ◽  
Safa Dehghan M ◽  
Iman Shafieenejad

In this study, satellite formation flying guidance in the presence of under actuation using inter-vehicle Coulomb force is investigated. The Coulomb forces are used to stabilize the formation flying mission. For this purpose, the charge of satellites is determined to create appropriate attraction and repulsion and also, to maintain the distance between satellites. Static Coulomb formation of satellites equations including three satellites in triangular form was developed. Furthermore, the charge value of the Coulomb propulsion system required for such formation was obtained. Considering Under actuation of one of the formation satellites, the fault-tolerance approach is proposed for achieving mission goals. Following this approach, in the first step fault-tolerant guidance law is designed. Accordingly, the obtained results show stationary formation. In the next step, tomaintain the formation shape and dimension, a fault-tolerant control law is designed.


Fault Tolerant Reliable Protocol (FTRP) is proposed as a novel routing protocol designed for Wireless Sensor Networks (WSNs). FTRP offers fault tolerance reliability for packet exchange and support for dynamic network changes. The key concept used is the use of node logical clustering. The protocol delegates the routing ownership to the cluster heads where fault tolerance functionality is implemented. FTRP utilizes cluster head nodes along with cluster head groups to store packets in transient. In addition, FTRP utilizes broadcast, which reduces the message overhead as compared to classical flooding mechanisms. FTRP manipulates Time to Live values for the various routing messages to control message broadcast. FTRP utilizes jitter in messages transmission to reduce the effect of synchronized node states, which in turn reduces collisions. FTRP performance has been extensively through simulations against Ad-hoc On-demand Distance Vector (AODV) and Optimized Link State (OLSR) routing protocols. Packet Delivery Ratio (PDR), Aggregate Throughput and End-to-End delay (E-2-E) had been used as performance metrics. In terms of PDR and aggregate throughput, it is found that FTRP is an excellent performer in all mobility scenarios whether the network is sparse or dense. In stationary scenarios, FTRP performed well in sparse network; however, in dense network FTRP’s performance had degraded yet in an acceptable range. This degradation is attributed to synchronized nodes states. Reliably delivering a message comes to a cost, as in terms of E-2-E. results show that FTRP is considered a good performer in all mobility scenarios where the network is sparse. In sparse stationary scenario, FTRP is considered good performer, however in dense stationary scenarios FTRP’s E-2-E is not acceptable. There are times when receiving a network message is more important than other costs such as energy or delay. That makes FTRP suitable for wide range of WSNs applications, such as military applications by monitoring soldiers’ biological data and supplies while in battlefield and battle damage assessment. FTRP can also be used in health applications in addition to wide range of geo-fencing, environmental monitoring, resource monitoring, production lines monitoring, agriculture and animals tracking. FTRP should be avoided in dense stationary deployments such as, but not limited to, scenarios where high application response is critical and life endangering such as biohazards detection or within intensive care units.


Energies ◽  
2021 ◽  
Vol 14 (8) ◽  
pp. 2210
Author(s):  
Luís Caseiro ◽  
André Mendes

Fault-tolerance is critical in power electronics, especially in Uninterruptible Power Supplies, given their role in protecting critical loads. Hence, it is crucial to develop fault-tolerant techniques to improve the resilience of these systems. This paper proposes a non-redundant fault-tolerant double conversion uninterruptible power supply based on 3-level converters. The proposed solution can correct open-circuit faults in all semiconductors (IGBTs and diodes) of all converters of the system (including the DC-DC converter), ensuring full-rated post-fault operation. This technique leverages the versatility of Finite-Control-Set Model Predictive Control to implement highly specific fault correction. This type of control enables a conditional exclusion of the switching states affected by each fault, allowing the converter to avoid these states when the fault compromises their output but still use them in all other conditions. Three main types of corrective actions are used: predictive controller adaptations, hardware reconfiguration, and DC bus voltage adjustment. However, highly differentiated corrective actions are taken depending on the fault type and location, maximizing post-fault performance in each case. Faults can be corrected simultaneously in all converters, as well as some combinations of multiple faults in the same converter. Experimental results are presented demonstrating the performance of the proposed solution.


2021 ◽  
Vol 9 (6) ◽  
pp. 574
Author(s):  
Zhuo Liu ◽  
Tianhao Tang ◽  
Azeddine Houari ◽  
Mohamed Machmoum ◽  
Mohamed Fouad Benkhoris

This paper firstly adopts a fault accommodation structure, a five-phase permanent magnet synchronous generator (PMSG) with trapezoidal back-electromagnetic forces, in order to enhance the fault tolerance of tidal current energy conversion systems. Meanwhile, a fault-tolerant control (FTC) method is proposed using multiple second-order generalized integrators (multiple SOGIs) to further improve the systematic fault tolerance. Then, additional harmonic disturbances from phase current or back-electromagnetic forces in original and Park’s frames are characterized under a single-phase open condition. Relying on a classical field-oriented vector control scheme, fault-tolerant composite controllers are then reconfigured using multiple SOGIs by compensating q-axis control commands. Finally, a real power-scale simulation setup with a gearless back-to-back tidal current energy conversion chain and a small power-scale laboratory prototype in machine side are established to comprehensively validate feasibility and fault tolerance of the proposed method. Simulation results show that the proposed method is able to suppress the main harmonic disturbances and maintain a satisfactory fault tolerance when third harmonic flux varies. Experimental results reveal that the proposed model-free fault-tolerant design is simple to implement, which contributes to better fault-tolerant behaviors, higher power quality and lower copper losses. The main advantage of the multiple SOGIs lies in convenient online implementation and efficient multi-harmonic extractions, without considering system’s model parameters. The proposed FTC design provides a model-free fault-tolerant solution to the energy harvested process of actual tidal current energy conversion systems under different working conditions.


2014 ◽  
Vol 548-549 ◽  
pp. 1326-1329
Author(s):  
Juan Jin ◽  
Qing Fan Gu

Against to the unsustainable problems of health diagnosis, fault location and fault tolerance mechanisms that existing in the current avionics applications, we proposed a fault-tolerant communication middleware which is based on time-triggered in this paper. This middleware is designed to provide a support platform for applications of the real-time based on communication middleware. From the communication middleware level and also combined with time-triggered mechanism and fault-tolerant strategy, it diagnoses the general faults first, and then routes them to the appropriate fault mechanism to process it. So the middleware completely separates fault-tolerant process from the application software functions.


2018 ◽  
Vol 8 (3) ◽  
pp. 20-31 ◽  
Author(s):  
Sam Goundar ◽  
Akashdeep Bhardwaj

With mission critical web applications and resources being hosted on cloud environments, and cloud services growing fast, the need for having greater level of service assurance regarding fault tolerance for availability and reliability has increased. The high priority now is ensuring a fault tolerant environment that can keep the systems up and running. To minimize the impact of downtime or accessibility failure due to systems, network devices or hardware, the expectations are that such failures need to be anticipated and handled proactively in fast, intelligent way. This article discusses the fault tolerance system for cloud computing environments, analyzes whether this is effective for Cloud environments.


2013 ◽  
Vol 734-737 ◽  
pp. 3048-3052
Author(s):  
Peng Wang ◽  
Yan Lv ◽  
Yu Tan

According to advantages and disadvantages of the traditional data center network structure ,this paper propose a new data center network structure base on BCube and DCell. The new structure is mainly improved based on the scalability, fault tolerance, the throughput


Author(s):  
I.V. Asharina

This three-part paper analyzes existing approaches and methods of organizing failure- and fault-tolerant computing in distributed multicomputer systems (DMCS), identifies and provides rationale for a list of issues to be solved. We present the concept of fault tolerance proposed by A. Avizienis, explicate its dissimilarity from the modern concept and the reason for its inapplicability with regard to modern distributed multicomputer systems. We justify the necessity to refine the definition of fault tolerance approved by the State Standards, as well as the necessity to specify three input parameters to be taken into account in the DMCS design methods: permitted fault models, permitted multiplicity of faults, permitted fault sequence capabilities. We formulate the questions that must be answered in order to design a truly reliable, fault-tolerant system and consider the application areas of the failure- and fault-tolerant control systems for complex network and distributed objects. System, functional, and test diagnostics serve as the basis for building unattended failure- and fault-tolerant systems. The concept of self-managed degradation (with the DMCS eventually proceeding to a safe shutdown at a critical level of degradation) is a means to increase the DMCS active life. We consider the issues related to the diagnosis of multiple faults and present the main differences in ensuring fault tolerance between systems with broadcast communication channels and systems with point-to-point communication channels. The first part of the work mainly deals with the analysis of existing approaches and methods of organizing failure- and fault-tolerant computing in DMCS and the definition of the concept of fault-tolerance.


Sign in / Sign up

Export Citation Format

Share Document