Providing fault tolerance through invasive computing

2016 ◽  
Vol 58 (6) ◽  
Author(s):  
Vahid Lari ◽  
Andreas Weichslgartner ◽  
Alexandru Tanase ◽  
Michael Witterauf ◽  
Faramarz Khosravi ◽  
...  

AbstractAs a consequence of technology scaling, today's complex multi-processor systems have become more and more susceptible to errors. In order to satisfy reliability requirements, such systems require methods to detect and tolerate errors. This entails two major challenges: (a) providing a comprehensive approach that ensures fault-tolerant execution of parallel applications across different types of resources, and (b) optimizing resource usage in the face of dynamic fault probabilities or with varying fault tolerance needs of different applications. In this paper, we present a holistic and adaptive approach to provide fault tolerance on Multi-Processor System-on-a-Chip (MPSoC) on demand of an application or environmental needs based on invasive computing. We show how invasive computing may provide adaptive fault tolerance on a heterogeneous MPSoC including hardware accelerators and communication infrastructure such as a Network-on-Chip (NoC). In addition, we present (a) compile-time transformations to automatically adopt well-known redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) for fault-tolerant loop execution on a class of massively parallel arrays of processors called as Tightly Coupled Processor Arrays (). Based on timing characteristics derived from our compilation flow, we further develop (b) a reliability analysis guiding the selection of a suitable degree of fault tolerance. Finally, we present (c) a methodology to detect and adaptively mitigate faults in invasive NoCs.

2021 ◽  
Author(s):  
Gopalakrishnan Sundararajan

This Chapter presents a solution for fault-tolerance in Multi-Valued Logic (MVL) circuits comprised of Carbon Nano-Tube Field Effect Transistors (CNTFET). This chapter reviews basic primitives of MVL and describes ternary implementations of CNTFET circuits. Finally, this chapter describes a method for error correction called Restorative Feedback (RFB). The RFB method is a variant of Triple-Modular Redundancy (TMR) that utilizes the fault masking capabilities of the Muller C element to provide added protection against noisy transient faults. Fault tolerant properties of Muller C element is discussed and error correction capability of RFB method is demonstrated in detail.


Author(s):  
Dimitar Nikolov ◽  
Mikael Väyrynen ◽  
Urban Ingelsson ◽  
Virendra Singh ◽  
Erik Larsson

While the rapid development in semiconductor technologies makes it possible to manufacture integrated circuits (ICs) with multiple processors, so called Multi-Processor System-on-Chip (MPSoC), ICs manufactured in recent semiconductor technologies are becoming increasingly susceptible to transient faults, which enforces fault tolerance. Work on fault tolerance has mainly focused on safety-critical applications; however, the development of semiconductor technologies makes fault tolerance also needed for general-purpose systems. Different from safety-critical systems where meeting hard deadlines is the main requirement, it is for general-purpose systems more important to minimize the average execution time (AET). The contribution of this chapter is two-fold. First, the authors present a mathematical framework for the analysis of AET. Their analysis of AET is performed for voting, rollback recovery with checkpointing (RRC), and the combination of RRC and voting (CRV) where for a given job and soft (transient) error probability, the authors define mathematical formulas for each of the fault-tolerant techniques with the objective to minimize AET while taking bus communication overhead into account. And, for a given number of processors and jobs, the authors define integer linear programming models that minimize AET including communication overhead. Second, as error probability is not known at design time and it can change during operation, they present two techniques, periodic probability estimation (PPE) and aperiodic probability estimation (APE), to estimate the error probability and adjust the fault tolerant scheme while the IC is in operation.


2016 ◽  
Vol 25 (06) ◽  
pp. 1650065 ◽  
Author(s):  
Saleh Fakhrali ◽  
Hamid R. Zarandi

Reliability is one of the main concerns in the design of networks-on-chip (NoCs) due to the use of deep submicron technologies in fabrication of such products. This paper presents a new fault-tolerant routing algorithm called double stairs for NoCs. Double stairs routing algorithm is a low overhead routing that has the ability to deal with fault. The proposed routing algorithm makes a redundant copy of each packet at the source node and routes the original and redundant packets in a new partially adaptive routing algorithm. The method is evaluated for various packet injection rates and fault rates. Experimental results show that the proposed routing algorithm offers the best trade-off between performance and fault tolerance compared to other routing algorithms, namely flooding, XYX and probabilistic flooding.


1999 ◽  
Vol 121 (3) ◽  
pp. 504-508 ◽  
Author(s):  
E. H. Maslen ◽  
C. K. Sortore ◽  
G. T. Gillies ◽  
R. D. Williams ◽  
S. J. Fedigan ◽  
...  

A fault tolerant magnetic bearing system was developed and demonstrated on a large flexible-rotor test rig. The bearing system comprises a high speed, fault tolerant digital controller, three high capacity radial magnetic bearings, one thrust bearing, conventional variable reluctance position sensors, and an array of commercial switching amplifiers. Controller fault tolerance is achieved through a very high speed voting mechanism which implements triple modular redundancy with a powered spare CPU, thereby permitting failure of up to three CPU modules without system failure. Amplifier/cabling/coil fault tolerance is achieved by using a separate power amplifier for each bearing coil and permitting amplifier reconfiguration by the controller upon detection of faults. This allows hot replacement of failed amplifiers without any system degradation and without providing any excess amplifier kVA capacity over the nominal system requirement. Implemented on a large (2440 mm in length) flexible rotor, the system shows excellent rejection of faults including the failure of three CPUs as well as failure of two adjacent amplifiers (or cabling) controlling an entire stator quadrant.


Electronics ◽  
2020 ◽  
Vol 9 (11) ◽  
pp. 1783 ◽  
Author(s):  
Ayaz Hussain ◽  
Muhammad Irfan ◽  
Naveed Khan Baloch ◽  
Umar Draz ◽  
Tariq Ali ◽  
...  

The router plays an important role in communication among different processing cores in on-chip networks. Technology scaling on one hand has enabled the designers to integrate multiple processing components on a single chip; on the other hand, it becomes the reason for faults. A generic router consists of the buffers and pipeline stages. A single fault may result in an undesirable situation of degraded performance or a whole chip may stop working. Therefore, it is necessary to provide permanent fault tolerance to all the components of the router. In this paper, we propose a mechanism that can tolerate permanent faults that occur in the router. We exploit the fault-tolerant techniques of resource sharing and paring between components for the input port unit and routing computation (RC) unit, the resource borrowing for virtual channel allocator (VA) and multiple paths for switch allocator (SA) and crossbar (XB). The experimental results and analysis show that the proposed mechanism enhances the reliability of the router architecture towards permanent faults at the cost of 29% area overhead. The proposed router architecture achieves the highest Silicon Protection Factor (SPF) metric, which is 24.8 as compared to the state-of-the-art fault-tolerant architectures. It incurs an increase in latency for SPLASH2 and PARSEC benchmark traffics, which is minimal as compared to the baseline router.


Author(s):  
Aleksandr Gruzlikov ◽  
Nikolai Kolesov ◽  
Dmitri Kostygov ◽  
Marina Tolmacheva

Introduction: The majority of real complex systems are designed with respect to fault tolerance requirements. However, all theknown approaches are intended only to increase reliability. Purpose: An approach for designing fault-tolerant systems on a chip, aimednot only at increasing the reliability, but also at reducing the energy consumed by the system. Results: A two-stage approach to thedesign of fault-tolerant multicore systems-on-chip (MCSoCs) is proposed. At the first stage, an energy-efficient architecture of thedesigned system is formed. For each core used in the system, the optimal number of additional cores is determined within the frameworkof the imposed restrictions. The optimality criterion is the minimum power consumed by the system. The algorithm proposed for theformation of an energy-efficient architecture is based on the dependence of the power consumed in the system on the values of the supplyvoltage and the clock frequency. At the second stage, a procedure for diagnosing and repairing the system is developed which uses theprinciples of system-level diagnosis, involving mutual checks between the system cores. This procedure allows you to decentralize theprocess of diagnosing and restoring the system after a failure. Additionally, the article examines the organization of the communicationsubsystem based on shared memory. The study is based on a simulation conducted in order to estimate the time for making a decisionabout a failure in systems such as a lattice, torus and hypercube. Practical relevance: The proposed approach allows a system to providethe necessary values for its two most important characteristics: fault tolerance and energy efficiency. At the same time, decentralizationis ensured when making decisions about a failure and restoration. As a result, the system becomes more reliable.


Author(s):  
Camille Coti

This chapter gives an overview of techniques used to tolerate failures in high-performance distributed applications. We describe basic replication techniques, automatic rollback recovery and application-based fault tolerance. We present the challenges raised specifically by distributed, high performance computing and the performance overhead the fault tolerance mechanisms are likely to cost. Last, we give an example of a fault-tolerant algorithm that exploits specific properties of a recent algorithm.


2012 ◽  
Vol 21 (01) ◽  
pp. 1250004 ◽  
Author(s):  
LINJIE ZHU ◽  
TONGQUAN WEI ◽  
XIAODAO CHEN ◽  
YONGHE GUO ◽  
SHIYAN HU

Fault tolerance and energy have become important design issues in multiprocessor system-on-chips (SoCs) with the technology scaling and the proliferation of battery-powered multiprocessor SoCs. This paper proposed an energy-efficient fault tolerance task allocation scheme for multiprocessor SoCs in real-time energy harvesting systems. The proposed fault-tolerance scheme is based on the principle of the primiary/backup task scheduling, and can tolerate at most one single transient fault. Extensive simulated experiment shows that the proposed scheme can save up to 30% energy consumption and reduce the miss ratio to about 8% in the presence of faults.


2019 ◽  
pp. 258-264
Author(s):  
Sergey F. Tyurin

The so-called Fault-Tolerant Systems (FTS) use the structural, temporal, functional, or information redundancy for the achievement of the high reliability. For example, Radiation Hardened by Design (RHBD) Systems are Fault-Tolerant Systems. A Passive FTS, due to a very large structural redundancy (Modular Redundancy), produces faults masking. The Triple Modular Redundancy (TMR) Method has more than 300% redundancy. The Quad Redundancy (QR) Method boasts more than 400% redundancy. The CMOS transistors QR (transistor-level redundancy) is the most effective QR. In this case, no voting element is needed. However, this significantly increases the time delay. In addition, it is necessary to ensure compliance with the Mead-Conway restrictions. QR, in contrast to TMR, raises the problem of checking the redundant structure. The author proposes a QR Checking Method based on a selection of substrates of the CMOS transistors. The power lines of the transistor substrates are separated, which ensures the disconnection of part of the reserve. A simulation confirms the feasibility of the proposed method.


Author(s):  
B. Naresh Kumar Reddy ◽  
Vasantha M.H ◽  
Nithin Kumar Y.B.

<p class="Standard">Network on Chip (NoC) is a communication subsystem, which has the logic for sending and receiving the data from different sources in a single IC, is adopting the technology of VLSI making it to be as compact as possible. However, the increasing probability of failures in NoC’s has been raising concern among the researchers due to large scale integration of components. In specific the issues of fault-tolerance, increase in length of global wires of NoC has to be addressed for on chip and multi core architectures. This survey presents a perspective on existing NoC Fault-tolerant algorithm and a Corresponding distributed fault analysis strategy that encourages in observing the fault status of individual NoC components and their adjacent communication links. The analysis of the Fault-tolerant Network subjected to dynamic workloads for large scale applications is also equally important. This research paper mainly emphasizes on Fault tolerant NoC strategies summarizing over thirty research papers.</p>


Sign in / Sign up

Export Citation Format

Share Document