scholarly journals Non-intrusive System Level Fault-Tolerance

Author(s):  
Kristina Lundqvist ◽  
Jayakanth Srinivasan ◽  
Sébastien Gorelov
Keyword(s):  
Author(s):  
Mohammad Salehi ◽  
Florian Kriebel ◽  
Semeen Rehman ◽  
Muhammad Shafique

AbstractPower-constrained fault-tolerance has emerged as a key challenge in the deep sub-micron technology. Multi-/many-core chips can support different hardening modes considering variants of redundant multithreading (RMT). In dark silicon chips, the maximum number of cores that can simultaneously be powered-on (at the full performance level) is constrained by the thermal design power (TDP). The rest of the cores have to be power-gated (i.e., stay “dark”), or the cores have to operate at a lower performance level. It has been predicted that about 25–50% of a many-core chip can potentially be “dark.” In this chapter, a system-level power–reliability management technique is presented. The technique jointly considers multiple hardening modes at the software and hardware levels, each offering distinct power, reliability, and performance properties. Also, a framework for the system-level optimization is introduced which considers different power–reliability–performance management problems for many-core processors depending upon the target system and user constraints.


Author(s):  
Aleksandr Gruzlikov ◽  
Nikolai Kolesov ◽  
Dmitri Kostygov ◽  
Marina Tolmacheva

Introduction: The majority of real complex systems are designed with respect to fault tolerance requirements. However, all theknown approaches are intended only to increase reliability. Purpose: An approach for designing fault-tolerant systems on a chip, aimednot only at increasing the reliability, but also at reducing the energy consumed by the system. Results: A two-stage approach to thedesign of fault-tolerant multicore systems-on-chip (MCSoCs) is proposed. At the first stage, an energy-efficient architecture of thedesigned system is formed. For each core used in the system, the optimal number of additional cores is determined within the frameworkof the imposed restrictions. The optimality criterion is the minimum power consumed by the system. The algorithm proposed for theformation of an energy-efficient architecture is based on the dependence of the power consumed in the system on the values of the supplyvoltage and the clock frequency. At the second stage, a procedure for diagnosing and repairing the system is developed which uses theprinciples of system-level diagnosis, involving mutual checks between the system cores. This procedure allows you to decentralize theprocess of diagnosing and restoring the system after a failure. Additionally, the article examines the organization of the communicationsubsystem based on shared memory. The study is based on a simulation conducted in order to estimate the time for making a decisionabout a failure in systems such as a lattice, torus and hypercube. Practical relevance: The proposed approach allows a system to providethe necessary values for its two most important characteristics: fault tolerance and energy efficiency. At the same time, decentralizationis ensured when making decisions about a failure and restoration. As a result, the system becomes more reliable.


Sign in / Sign up

Export Citation Format

Share Document