Non-intrusive System Level Fault-Tolerance

AbstractPower-constrained fault-tolerance has emerged as a key challenge in the deep sub-micron technology. Multi-/many-core chips can support different hardening modes considering variants of redundant multithreading (RMT). In dark silicon chips, the maximum number of cores that can simultaneously be powered-on (at the full performance level) is constrained by the thermal design power (TDP). The rest of the cores have to be power-gated (i.e., stay “dark”), or the cores have to operate at a lower performance level. It has been predicted that about 25–50% of a many-core chip can potentially be “dark.” In this chapter, a system-level power–reliability management technique is presented. The technique jointly considers multiple hardening modes at the software and hardware levels, each offering distinct power, reliability, and performance properties. Also, a framework for the system-level optimization is introduced which considers different power–reliability–performance management problems for many-core processors depending upon the target system and user constraints.

Download Full-text

Incorporating Fault-Tolerance Awareness into System-Level Modeling and Simulation

10.1109/ftxs54580.2021.00008 ◽

2021 ◽

Author(s):

Trokon Johnson ◽

Herman Lam

Keyword(s):

Fault Tolerance ◽

Modeling And Simulation ◽

System Level ◽

System Level Modeling

Download Full-text

Non-Intrusive System-Level Fault Tolerance for an Electronic Throttle Controller

International Conference on Networking, International Conference on Systems and International Conference on Mobile Communications and Learning Technologies (ICNICONSMCL'06) ◽

10.1109/icniconsmcl.2006.156 ◽

2006 ◽

Cited By ~ 1

Author(s):

Y. Boussemart ◽

S. Gorelov ◽

M. Ouimet ◽

K. Lundqvist

Keyword(s):

Fault Tolerance ◽

System Level ◽

Electronic Throttle

Download Full-text

SuperGlue: IDL-Based, System-Level Fault Tolerance for Embedded Systems

2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) ◽

10.1109/dsn.2016.29 ◽

2016 ◽

Cited By ~ 2

Author(s):

Jiguo Song ◽

Gedare Bloom ◽

Gabriel Parmer

Keyword(s):

Fault Tolerance ◽

Embedded Systems ◽

System Level

Download Full-text

System level energy aware fault tolerance approach for real time system

TENCON 2008 - 2008 IEEE Region 10 Conference ◽

10.1109/tencon.2008.4766854 ◽

2008 ◽

Cited By ~ 1

Author(s):

Smriti Agrawal ◽

Rama Shankar Yadav ◽

Ranvijay

Keyword(s):

Fault Tolerance ◽

Real Time ◽

System Level ◽

Level Energy ◽

Energy Aware ◽

Time System ◽

Real Time System ◽

Tolerance Approach

Download Full-text

A novel approach to system-level fault tolerance in hypercube multiprocessors

Proceedings of the third conference on Hypercube concurrent computers and applications Architecture, software, computer systems, and general issues - ◽

10.1145/62297.62330 ◽

1988 ◽

Cited By ~ 5

Author(s):

P. Banerjee ◽

C. B. Stunkel

Keyword(s):

Fault Tolerance ◽

System Level ◽

Novel Approach

Download Full-text

Fault-tolerant and energy-efficient MCSoC for information processing and control

Information and Control Systems ◽

10.31799/1684-8853-2019-4-9-18 ◽

2019 ◽

pp. 9-18

Author(s):

Aleksandr Gruzlikov ◽

Nikolai Kolesov ◽

Dmitri Kostygov ◽

Marina Tolmacheva

Keyword(s):

Fault Tolerance ◽

Energy Efficient ◽

Fault Tolerant ◽

Optimal Number ◽

System Level ◽

Clock Frequency ◽

Systems On Chip ◽

Energy Efficient Architecture ◽

On Chip ◽

And Control

Introduction: The majority of real complex systems are designed with respect to fault tolerance requirements. However, all theknown approaches are intended only to increase reliability. Purpose: An approach for designing fault-tolerant systems on a chip, aimednot only at increasing the reliability, but also at reducing the energy consumed by the system. Results: A two-stage approach to thedesign of fault-tolerant multicore systems-on-chip (MCSoCs) is proposed. At the first stage, an energy-efficient architecture of thedesigned system is formed. For each core used in the system, the optimal number of additional cores is determined within the frameworkof the imposed restrictions. The optimality criterion is the minimum power consumed by the system. The algorithm proposed for theformation of an energy-efficient architecture is based on the dependence of the power consumed in the system on the values of the supplyvoltage and the clock frequency. At the second stage, a procedure for diagnosing and repairing the system is developed which uses theprinciples of system-level diagnosis, involving mutual checks between the system cores. This procedure allows you to decentralize theprocess of diagnosing and restoring the system after a failure. Additionally, the article examines the organization of the communicationsubsystem based on shared memory. The study is based on a simulation conducted in order to estimate the time for making a decisionabout a failure in systems such as a lattice, torus and hypercube. Practical relevance: The proposed approach allows a system to providethe necessary values for its two most important characteristics: fault tolerance and energy efficiency. At the same time, decentralizationis ensured when making decisions about a failure and restoration. As a result, the system becomes more reliable.

Download Full-text

Incorporating Fault-Tolerance Awareness into System-Level Modeling and Simulation

10.1109/cluster48925.2021.00080 ◽

2021 ◽

Author(s):

Trokon Johnson ◽

Herman Lam

Keyword(s):

Fault Tolerance ◽

Modeling And Simulation ◽

System Level ◽

System Level Modeling

Download Full-text