type
status
date
slug
summary
tags
category
icon
password
The goal of a fault-tolerant computer is to provide safety and liveness, despite the possibility of
faults.
- safe: never produce an incorrect user-visible result (hide effects of fault if fault happens)
- live: continues to make progress even in the presence of faults (safe does not guarantee the computer does anything useful)
- if cannot provide live in all fault scenario, safe is important
- e.g. ATM with fault, the bank would rather the ATM shut down instead of dispensing incorrect amount of cash
1.1 Goals of this Book
Goals:
- the key ideas in fault-tolerant computer architecture
- present the current state-of-the-art in academia and industry
Fault-tolerant computer architecture is not a new field.
- medical equipment, avionics, and car electronics
- triple modular redundancy (TMR) or the more general N-modular redundancy (NMR) first proposed by von Neumann

1.2 Faults, Errors, and Failures
Fault: a physical flaw, such as a broken wire or a transistor with a gate oxide that has broken down.
- A fault can manifest itself as an error, such as a bit that is a zero instead of a one, or the effect of the fault can be masked and not manifest itself as any error.
- Similarly, an error can be masked or it can result in a user-visible incorrect behavior called a failure.
- Failures include incorrect computations and system hangs.
1.2.1 Masking
Masking: can occurs at several levels, so faults do not become errors, errors do not become failures
- Logical masking. The effect of an error may be logically masked. For example, if a two-input AND gate has an error on one input and a zero on its other input, the error cannot propagate and cause a failure.
- branch misprediction recovery logic recovers faults in branch predictor
- Architectural masking. The effect of an error may never propagate to architectural state and thus never become a user-visible failure. For example, an error in the destination register specifier of a NOP instruction will have no architectural impact.
- Architectural vulnerability factor (AVF) [23], which is a metric for quantifying what fraction of errors in a given component are architecturally masked.
- Application masking. Even if an error does impact architectural state and thus becomes a user-visible failure, the failure might never be observed by the application software running on the processor. For example, an error that changes the value at a location in memory is user-visible; however, if the application never accesses that location or writes over the erroneous value before reading it again, then the failure is masked.
1.2.2 Duration(持续时间) of Faults and Errors
Faults and errors can be transient, permanent, or intermittent in nature.
- Transient. A transient fault occurs once and then does not persist. An error due to a transient fault is often referred to as a soft error or single event upset.
- Permanent. A permanent fault, which is often called a hard fault, occurs at some point in time, perhaps even introduced during chip fabrication, and persists from that time onward.
- Intermittent(间歇性的). An intermittent fault occurs repeatedly but not continuously in the same place in the processor.
按照持续时间分类是为了提供不同的 fault tolerate 方法。
1.2.3 Underlying Physical Phenomena
Transient phenomena:
- cosmic radiation
- high-energy particles that are produced when cosmic rays impact the atmosphere
- alpha particles
- produced by the natural decay of radioactive isotopes, from metal in the chip packaging itself
- high-energy particles or alpha particles can dislodge(驱逐) a significant amount of charge (electrons and holes) within the semiconductor material.
- If this charge exceeds the critical charge, often denoted Qcrit, of an SRAM or DRAM cell or p–n junction, it can flip the value
- electromagnetic interference (EMI)
- from outside sources, or in chip itself (cross-talk)
- supply voltage droops due to large, quick changes in current draw
- dI/dt problem
Permanent phenomena:
- Physical wear-out
- wire: electromigration
- transistor’s gate oxide break down
- thermal cycling
- mechanical stress
- above all can be modeled by RAMP model of Srinivasan et al.
- lifetime reliability management
- A processor can use the RAMP model to estimate its expected lifetime and adjust itself
- for example, by reducing its voltage and frequency—to either extend its lifetime (at the expense of performance) or improve its performance (at the expense of lifetime reliability)
- Fabrication defects
- inherent defects during fabrication
- compared with physical wear-out
- fabrication defects occur at time zero and they are much more likely to occur “simultaneously”
- Design bugs
- floating point division bug in the Intel Pentium processor
Intermittent phenomena:
- loose connection
1.3 Trends Leading to Increased Fault Rates
1.3.1 Smaller Devices and Hotter Chips
The dimensions of transistors and wires directly affect the likelihood of faults, both transient and
permanent. Furthermore, device dimensions impact chip temperature, and temperature has a strong impact on the likelihood of permanent faults.
- transient fault
- smaller device have smaller critical charges, Qcrit
- increasing probability for a high-energy particle to disrupt the charge on device
- permanent fault
- dimensions of fabricated devices and wires may stray from their expected values
- given smaller dimensions and greater process variability, there is an increasing likelihood of wires that are too small to support the required current density and transistor gate oxides that are too thin to withstand the voltages applied across them.
- temperature
- increase in power consumption per unit area translates into greater temperatures, and increasing temperatures greatly exacerbate several physical phenomena that cause permanent faults.
- as the temperature increases, the leakage current increases
1.3.2 More Devices per Processor
With more transistors, as well as more wires connecting them, there are more opportunities for
faults both in the field and during fabrication.
1.3.3 More Complicated Designs
The result of increased processor complexity is a greater likelihood of design bugs eluding the
validation process and escaping into the field.
1.4 Error Models
An error model is a simple, tractable tool for analyzing a system’s fault tolerance. An example of an error model is the well-known “stuck-at” model, which models the impact of faults that cause a circuit value to be stuck at either 0 or 1.
- There are many underlying physical phenomena that can be represented with the stuck-at model, including some short and open circuits.
- instead of considering the possible physical phenomena, is that architects can design systems to tolerate errors within a set of error models
1.4.1 Error Type
Stuck-at model.
One low-level error model, similar to stuck-at errors, is bridging errors (also known as coupling errors).
- a given circuit value is bridged or coupled to another circuit value.
A higher-level error model is the fail-stop error model.
- a component, such as a processor core or network switch, ceases to perform any function.
- For example, chipkill memory is designed to tolerate fail-stop errors in DRAM chips regardless of the underlying physical fault that leads to the fail-stop behavior.
A relatively new error model is the delay error model
- a circuit or component produces the correct value but at a time that is later than expected
1.4.2 Error Duration
transient, intermittent, and permanent
1.4.3 Number of Simultaneous Errors
most error models consider only a single error at a time.
Multiple-error scenarios are not only rare, but they are also far more difficult to reason about. Often, error models that permit multiple errors force architects to consider “offsetting errors”
- the affects of one error are hidden from the error detection mechanism by another error
- two bit flips in parity check
There are three reasons to consider error models with multiple simultaneous errors.
- First, for mission-critical computers, even a vanishingly small probability of a multiple error must be considered. It is not acceptable for these computers to fail in the presence of even a highly unlikely event.
- Second, there are trends leading to an increasing number of faults. At some fault rate, the probability of multiple errors becomes nonnegligible and worth expending resources to tolerate, even for non-mission-critical computers.
- Third, the possibility of latent errors, errors that occur but are undetected and linger in the system, can lead to subsequent multiple-error scenarios.
1.5 Fault Tolerance Metrics
1.5.1 Availability
The availability of a system at time t is the probability that the system is operating correctly at time
t.
1.5.2 Reliability
The reliability of a system at time t is the probability that the system has been operating correctly
from time zero until time t.
1.5.3 Mean Time to Failure
Mean time to failure (MTTF) is often an appropriate and useful metric.
- MTTF is a mean and that mean values do not fully represent probability distributions
- PA and PB, which have MTTF values of 10 and 12, respectively. PB might suffer more failures in the first 3 years than PA

1.5.4 Mean Time Between Failures
Mean time between failures (MTBF) is similar to MTTF, but it also considers the time to repair.
MTBF is the MTTF plus the mean time to repair (MTTR).

1.5.5 Failures in Time
The failures in time (FIT) rate of a component or a system is the number of failures it incurs over
one billion (10^9) hours, and it is inversely proportional to MTTF.
1.5.6 Architectural Vulnerability Factor (AVF)
The idea behind AVF is to classify microprocessor state as either required for architecturally correct execution (ACE state) or not (un-ACE state).
- For example, the program counter (PC) is almost always ACE state because a corruption of the PC almost always causes a deviation from ACE.
- The state of the branch predictor is always un-ACE because any state produced by a misprediction will not be architecturally visible
- there are many structures that have state that is ACE some fraction of the time
The AVF of a structure is computed as the average number of ACE bits in the structure
in a given cycle divided by the total number of bits in the structure. Thus, if many ACE bits reside
in a structure for a long time, that structure is highly vulnerable.
AVF can be used to scale a raw FIT rate into an effective FIT rate. The effective FIT rate of a
component is its raw FIT rate multiplied by its AVF.
1.6 The Rest of This Book
Fault tolerance consists of four aspects:
- Error detection (Chapter 2): A processor cannot tolerate a fault if it is unaware of it. Thus, error detection is the most important aspect of fault tolerance, and we devote the largest fraction of the book to this topic. Error detection can be performed at various granularities. For example, a localized error detection mechanism might check the correctness of an adder’s output, whereas a global or end-to-end error detection mechanism might check the correctness of an entire core.
- Error recovery (Chapter 3): When an error is detected, the processor must take action to mask its effects from the software. A key to error recovery is not making any state visible to the software until this state has been checked by the error detection mechanisms. A common approach to error recovery is for a processor to take periodic checkpoints of its architectural state and, upon error detection, reload into the processor’s state a checkpoint taken before the error occurred.
- Fault diagnosis (Chapter 4): Diagnosis is the process of identifying the fault that caused an error. For transient faults, diagnosis is generally unnecessary because the processor is not going to take any action to repair the fault. However, for permanent faults, it is often desirable to determine that the fault is permanent and then to determine its location. Knowing the location of a permanent fault enables a self-repair scheme to deconfigure the faulty component. If an error detection mechanism is localized, then it also provides diagnosis, but an end-to-end error detection mechanism provides little insight into what caused the error. If diagnosis is desired in a processor that uses an end-to-end error detection mechanism, then the architect must add a diagnosis mechanism.
- Self-repair (Chapter 5): If a processor diagnoses a permanent fault, it is desirable to repair or reconfigure the processor. Self-repair may involve avoiding further use of the faulty component or reconfiguring the processor to use a spare component.
Related topics that we do not include in this book, including
- Mechanisms for reducing vulnerability to faults
- Schemes for tolerating CMOS process variability
- Design validation and verification
- Fault-tolerant I/O, including disks and network controllers (this book focus on processor and memory)
- Approaches for tolerating software bugs