Fault Tolerant Computer Architecture - Chapter 1 - Introduction

type

status

date

slug

summary

1.1 Goals of this Book

Goals:

the key ideas in fault-tolerant computer architecture

present the current state-of-the-art in academia and industry

Fault-tolerant computer architecture is not a new field.

medical equipment, avionics, and car electronics

triple modular redundancy (TMR) or the more general N-modular redundancy (NMR) first proposed by von Neumann

1.2 Faults, Errors, and Failures

Fault: a physical flaw, such as a broken wire or a transistor with a gate oxide that has broken down.

A fault can manifest itself as an error, such as a bit that is a zero instead of a one, or the effect of the fault can be masked and not manifest itself as any error.

Similarly, an error can be masked or it can result in a user-visible incorrect behavior called a failure.

Failures include incorrect computations and system hangs.

1.2.1 Masking

Masking: can occurs at several levels, so faults do not become errors, errors do not become failures

Logical masking. The effect of an error may be logically masked. For example, if a two-input AND gate has an error on one input and a zero on its other input, the error cannot propagate and cause a failure.

branch misprediction recovery logic recovers faults in branch predictor

Architectural masking. The effect of an error may never propagate to architectural state and thus never become a user-visible failure. For example, an error in the destination register specifier of a NOP instruction will have no architectural impact.

Architectural vulnerability factor (AVF) [23], which is a metric for quantifying what fraction of errors in a given component are architecturally masked.

Application masking. Even if an error does impact architectural state and thus becomes a user-visible failure, the failure might never be observed by the application software running on the processor. For example, an error that changes the value at a location in memory is user-visible; however, if the application never accesses that location or writes over the erroneous value before reading it again, then the failure is masked.

1.2.2 Duration（持续时间） of Faults and Errors

Faults and errors can be transient, permanent, or intermittent in nature.

Transient. A transient fault occurs once and then does not persist. An error due to a transient fault is often referred to as a soft error or single event upset.

Permanent. A permanent fault, which is often called a hard fault, occurs at some point in time, perhaps even introduced during chip fabrication, and persists from that time onward.

Intermittent（间歇性的）. An intermittent fault occurs repeatedly but not continuously in the same place in the processor.

按照持续时间分类是为了提供不同的 fault tolerate 方法。

1.2.3 Underlying Physical Phenomena

Transient phenomena:

cosmic radiation

high-energy particles that are produced when cosmic rays impact the atmosphere

alpha particles

produced by the natural decay of radioactive isotopes, from metal in the chip packaging itself
high-energy particles or alpha particles can dislodge(驱逐) a significant amount of charge (electrons and holes) within the semiconductor material.

If this charge exceeds the critical charge, often denoted Qcrit, of an SRAM or DRAM cell or p–n junction, it can flip the value

electromagnetic interference (EMI)

from outside sources, or in chip itself (cross-talk)

supply voltage droops due to large, quick changes in current draw

dI/dt problem

Permanent phenomena:

Physical wear-out

wire: electromigration
transistor’s gate oxide break down
thermal cycling
mechanical stress
above all can be modeled by RAMP model of Srinivasan et al.

lifetime reliability management

A processor can use the RAMP model to estimate its expected lifetime and adjust itself

for example, by reducing its voltage and frequency—to either extend its lifetime (at the expense of performance) or improve its performance (at the expense of lifetime reliability)

Fabrication defects

inherent defects during fabrication
compared with physical wear-out

fabrication defects occur at time zero and they are much more likely to occur “simultaneously”

Design bugs

floating point division bug in the Intel Pentium processor

Intermittent phenomena:

loose connection

1.3 Trends Leading to Increased Fault Rates

1.3.1 Smaller Devices and Hotter Chips

The dimensions of transistors and wires directly affect the likelihood of faults, both transient and permanent. Furthermore, device dimensions impact chip temperature, and temperature has a strong impact on the likelihood of permanent faults.

transient fault

smaller device have smaller critical charges, Qcrit

increasing probability for a high-energy particle to disrupt the charge on device

permanent fault

dimensions of fabricated devices and wires may stray from their expected values

given smaller dimensions and greater process variability, there is an increasing likelihood of wires that are too small to support the required current density and transistor gate oxides that are too thin to withstand the voltages applied across them.

temperature

increase in power consumption per unit area translates into greater temperatures, and increasing temperatures greatly exacerbate several physical phenomena that cause permanent faults.
as the temperature increases, the leakage current increases

1.3.2 More Devices per Processor

With more transistors, as well as more wires connecting them, there are more opportunities for faults both in the field and during fabrication.

1.3.3 More Complicated Designs

The result of increased processor complexity is a greater likelihood of design bugs eluding the validation process and escaping into the field.

1.4 Error Models

An error model is a simple, tractable tool for analyzing a system’s fault tolerance. An example of an error model is the well-known “stuck-at” model, which models the impact of faults that cause a circuit value to be stuck at either 0 or 1.

There are many underlying physical phenomena that can be represented with the stuck-at model, including some short and open circuits.

instead of considering the possible physical phenomena, is that architects can design systems to tolerate errors within a set of error models

1.4.1 Error Type

Stuck-at model.

One low-level error model, similar to stuck-at errors, is bridging errors (also known as coupling errors).

a given circuit value is bridged or coupled to another circuit value.

A higher-level error model is the fail-stop error model.

a component, such as a processor core or network switch, ceases to perform any function.

For example, chipkill memory is designed to tolerate fail-stop errors in DRAM chips regardless of the underlying physical fault that leads to the fail-stop behavior.

A relatively new error model is the delay error model

a circuit or component produces the correct value but at a time that is later than expected

1.4.2 Error Duration

transient, intermittent, and permanent

1.4.3 Number of Simultaneous Errors

most error models consider only a single error at a time.

Multiple-error scenarios are not only rare, but they are also far more difficult to reason about. Often, error models that permit multiple errors force architects to consider “offsetting errors”

the affects of one error are hidden from the error detection mechanism by another error

two bit flips in parity check

There are three reasons to consider error models with multiple simultaneous errors.

First, for mission-critical computers, even a vanishingly small probability of a multiple error must be considered. It is not acceptable for these computers to fail in the presence of even a highly unlikely event.

Second, there are trends leading to an increasing number of faults. At some fault rate, the probability of multiple errors becomes nonnegligible and worth expending resources to tolerate, even for non-mission-critical computers.

Third, the possibility of latent errors, errors that occur but are undetected and linger in the system, can lead to subsequent multiple-error scenarios.

1.5 Fault Tolerance Metrics

1.5.1 Availability

The availability of a system at time t is the probability that the system is operating correctly at time t.

1.5.2 Reliability

The reliability of a system at time t is the probability that the system has been operating correctly from time zero until time t.

1.5.3 Mean Time to Failure

Mean time to failure (MTTF) is often an appropriate and useful metric.

MTTF is a mean and that mean values do not fully represent probability distributions

PA and PB, which have MTTF values of 10 and 12, respectively. PB might suffer more failures in the first 3 years than PA

1.5.4 Mean Time Between Failures

Mean time between failures (MTBF) is similar to MTTF, but it also considers the time to repair. MTBF is the MTTF plus the mean time to repair (MTTR).

1.5.5 Failures in Time

The failures in time (FIT) rate of a component or a system is the number of failures it incurs over one billion (10^9) hours, and it is inversely proportional to MTTF.

1.5.6 Architectural Vulnerability Factor (AVF)

The idea behind AVF is to classify microprocessor state as either required for architecturally correct execution (ACE state) or not (un-ACE state).

For example, the program counter (PC) is almost always ACE state because a corruption of the PC almost always causes a deviation from ACE.

The state of the branch predictor is always un-ACE because any state produced by a misprediction will not be architecturally visible

there are many structures that have state that is ACE some fraction of the time

The AVF of a structure is computed as the average number of ACE bits in the structure in a given cycle divided by the total number of bits in the structure. Thus, if many ACE bits reside in a structure for a long time, that structure is highly vulnerable.

AVF can be used to scale a raw FIT rate into an effective FIT rate. The effective FIT rate of a component is its raw FIT rate multiplied by its AVF.

1.6 The Rest of This Book

Fault tolerance consists of four aspects:

Error detection (Chapter 2): A processor cannot tolerate a fault if it is unaware of it. Thus, error detection is the most important aspect of fault tolerance, and we devote the largest fraction of the book to this topic. Error detection can be performed at various granularities. For example, a localized error detection mechanism might check the correctness of an adder’s output, whereas a global or end-to-end error detection mechanism might check the correctness of an entire core.

Error recovery (Chapter 3): When an error is detected, the processor must take action to mask its effects from the software. A key to error recovery is not making any state visible to the software until this state has been checked by the error detection mechanisms. A common approach to error recovery is for a processor to take periodic checkpoints of its architectural state and, upon error detection, reload into the processor’s state a checkpoint taken before the error occurred.

Fault diagnosis (Chapter 4): Diagnosis is the process of identifying the fault that caused an error. For transient faults, diagnosis is generally unnecessary because the processor is not going to take any action to repair the fault. However, for permanent faults, it is often desirable to determine that the fault is permanent and then to determine its location. Knowing the location of a permanent fault enables a self-repair scheme to deconfigure the faulty component. If an error detection mechanism is localized, then it also provides diagnosis, but an end-to-end error detection mechanism provides little insight into what caused the error. If diagnosis is desired in a processor that uses an end-to-end error detection mechanism, then the architect must add a diagnosis mechanism.

Self-repair (Chapter 5): If a processor diagnoses a permanent fault, it is desirable to repair or reconfigure the processor. Self-repair may involve avoiding further use of the faulty component or reconfiguring the processor to use a spare component.

Related topics that we do not include in this book, including

Mechanisms for reducing vulnerability to faults

Schemes for tolerating CMOS process variability

Design validation and verification

Fault-tolerant I/O, including disks and network controllers (this book focus on processor and memory)

Approaches for tolerating software bugs