| |
| |
Foreword | |
| |
| |
Preface | |
| |
| |
Acknowledgements | |
| |
| |
About the Authors | |
| |
| |
| |
Preliminaries | |
| |
| |
| |
Fault Classification | |
| |
| |
| |
Types of Redundancy | |
| |
| |
| |
Basic Measures of Fault Tolerance | |
| |
| |
| |
Traditional Measures | |
| |
| |
| |
Network Measures | |
| |
| |
| |
Outline of This Book | |
| |
| |
| |
Further Reading | |
| |
| |
References | |
| |
| |
| |
Hardware Fault Tolerance | |
| |
| |
| |
The Rate of Hardware Failures | |
| |
| |
| |
Failure Rate, Reliability, and Mean Time to Failure | |
| |
| |
| |
Canonical and Resilient Structures | |
| |
| |
| |
Series and Parallel Systems | |
| |
| |
| |
Non-Series/Parallel Systems | |
| |
| |
| |
M-of-N Systems | |
| |
| |
| |
Voters | |
| |
| |
| |
Variations on N-Modular Redundancy | |
| |
| |
| |
Duplex Systems | |
| |
| |
| |
Other Reliability Evaluation Techniques | |
| |
| |
| |
Poisson Processes | |
| |
| |
| |
Markov Models | |
| |
| |
| |
Fault-Tolerance Processor-Level Techniques | |
| |
| |
| |
Watchdog Processor | |
| |
| |
| |
Simultaneous Multithreading for Fault Tolerance | |
| |
| |
| |
Byzantine Failures | |
| |
| |
| |
Byzantine Agreement with Message Authentication | |
| |
| |
| |
Further Reading | |
| |
| |
| |
Exercises | |
| |
| |
References | |
| |
| |
| |
Information Redundancy | |
| |
| |
| |
Coding | |
| |
| |
| |
Parity Codes | |
| |
| |
| |
Checksum | |
| |
| |
| |
M-of-N Codes | |
| |
| |
| |
Berger Code | |
| |
| |
| |
Cyclic Codes | |
| |
| |
| |
Arithmetic Codes | |
| |
| |
| |
Resilient Disk Systems | |
| |
| |
| |
Raid Level 1 | |
| |
| |
| |
Raid Level 2 | |
| |
| |
| |
Raid Level 3 | |
| |
| |
| |
Raid Level 4 | |
| |
| |
| |
Raid Level 5 | |
| |
| |
| |
Modeling Correlated Failures | |
| |
| |
| |
Data Replication | |
| |
| |
| |
Voting: Non-Hierarchical Organization | |
| |
| |
| |
Voting: Hierarchical Organization | |
| |
| |
| |
Primary-Backup Approach | |
| |
| |
| |
Algorithm-Based Fault Tolerance | |
| |
| |
| |
Further Reading | |
| |
| |
| |
Exercises | |
| |
| |
References | |
| |
| |
| |
Fault-Tolerant Networks | |
| |
| |
| |
Measures of Resilience | |
| |
| |
| |
Graph-Theoretical Measures | |
| |
| |
| |
Computer Networks Measures | |
| |
| |
| |
Common Network Topologies and Their Resilience | |
| |
| |
| |
Multistage and Extra-Stage Networks | |
| |
| |
| |
Crossbar Networks | |
| |
| |
| |
Rectangular Mesh and Interstitial Mesh | |
| |
| |
| |
Hypercube Network | |
| |
| |
| |
Cube-Connected Cycles Networks | |
| |
| |
| |
Loop Networks | |
| |
| |
| |
Ad hoc Point-to-Point Networks | |
| |
| |
| |
Fault-Tolerant Routing | |
| |
| |
| |
Hypercube Fault-Tolerant Routing | |
| |
| |
| |
Origin-Based Routing in the Mesh | |
| |
| |
| |
Further Reading | |
| |
| |
| |
Exercises | |
| |
| |
References | |
| |
| |
| |
Software Fault Tolerance | |
| |
| |
| |
Acceptance Tests | |
| |
| |
| |
Single-Version Fault Tolerance | |
| |
| |
| |
Wrappers | |
| |
| |
| |
Software Rejuvenation | |
| |
| |
| |
Data Diversity | |
| |
| |
| |
Software Implemented Hardware Fault Tolerance (SIHFT) | |
| |
| |
| |
N-Version Programming | |
| |
| |
| |
Consistent Comparison Problem | |
| |
| |
| |
Version Independence | |
| |
| |
| |
Recovery Block Approach | |
| |
| |
| |
Basic Principles | |
| |
| |
| |
Success Probability Calculation | |
| |
| |
| |
Distributed Recovery Blocks | |
| |
| |
| |
Preconditions, Postconditions, and Assertions | |
| |
| |
| |
Exception-Handling | |
| |
| |
| |
Requirements from Exception-Handlers | |
| |
| |
| |
Basics of Exceptions and Exception-Handling | |
| |
| |
| |
Language Support | |
| |
| |
| |
Software Reliability Models | |
| |
| |
| |
Jelinski-Moranda Model | |
| |
| |
| |
Littlewood-Verrall Model | |
| |
| |
| |
Musa-Okumoto Model | |
| |
| |
| |
Model Selection and Parameter Estimation | |
| |
| |
| |
Fault-Tolerant Remote Procedure Calls | |
| |
| |
| |
Primary-Backup Approach | |
| |
| |
| |
The Circus Approach | |
| |
| |
| |
Further Reading | |
| |
| |
| |
Exercises | |
| |
| |
References | |
| |
| |
| |
Checkpointing | |
| |
| |
| |
What is Checkpointing? | |
| |
| |
| |
Why is Checkpointing Nontrivial? | |
| |
| |
| |
Checkpoint Level | |
| |
| |
| |
Optimal Checkpointing-An Analytical Model | |
| |
| |
| |
Time Between Checkpoints-A First-Order Approximation | |
| |
| |
| |
Optimal Checkpoint Placement | |
| |
| |
| |
Time Between Checkpoints-A More Accurate Model | |
| |
| |
| |
Reducing Overhead | |
| |
| |
| |
Reducing Latency | |
| |
| |
| |
Cache-Aided Rollback Error Recovery (CARER) | |
| |
| |
| |
Checkpointing in Distributed Systems | |
| |
| |
| |
The Domino Effect and Livelock | |
| |
| |
| |
A Coordinated Checkpointing Algorithm | |
| |
| |
| |
Time-Based Synchronization | |
| |
| |
| |
Diskless Checkpointing | |
| |
| |
| |
Message Logging | |
| |
| |
| |
Checkpointing in Shared-Memory Systems | |
| |
| |
| |
Bus-Based Coherence Protocol | |
| |
| |
| |
Directory-Based Protocol | |
| |
| |
| |
Checkpointing in Real-Time Systems | |
| |
| |
| |
Other Uses of Checkpointing | |
| |
| |
| |
Further Reading | |
| |
| |
| |
Exercises | |
| |
| |
References | |
| |
| |
| |
Case Studies | |
| |
| |
| |
NonStop Systems | |
| |
| |
| |
Architecture | |
| |
| |
| |
Maintenance and Repair Aids | |
| |
| |
| |
Software | |
| |
| |
| |
Modifications to the NonStop Architecture | |
| |
| |
| |
Stratus Systems | |
| |
| |
| |
Cassini Command and Data Subsystem | |
| |
| |
| |
IBM G5 | |
| |
| |
| |
IBM Sysplex | |
| |
| |
| |
Itanium | |
| |
| |
| |
Further Reading | |
| |
| |
References | |
| |
| |
| |
Defect Tolerance in VLSI Circuits | |
| |
| |
| |
Manufacturing Defects and Circuit Faults | |
| |
| |
| |
Probability of Failure and Critical Area | |
| |
| |
| |
Basic Yield Models | |
| |
| |
| |
The Poisson and Compound Poisson Yield Models | |
| |
| |
| |
Variations on the Simple Yield Models | |
| |
| |
| |
Yield Enhancement Through Redundancy | |
| |
| |
| |
Yield Projection for Chips with Redundancy | |
| |
| |
| |
Memory Arrays with Redundancy | |
| |
| |
| |
Logic Integrated Circuits with Redundancy | |
| |
| |
| |
Modifying the Floorplan | |
| |
| |
| |
Further Reading | |
| |
| |
| |
Exercises | |
| |
| |
References | |
| |
| |
| |
Fault Detection in Cryptographic Systems | |
| |
| |
| |
Overview of Ciphers | |
| |
| |
| |
Symmetric Key Ciphers | |
| |
| |
| |
Public Key Ciphers | |
| |
| |
| |
Security Attacks Through Fault Injection | |
| |
| |
| |
Fault Attacks on Symmetric Key Ciphers | |
| |
| |
| |
Fault Attacks on Public (Asymmetric) Key Ciphers | |
| |
| |
| |
Countermeasures | |
| |
| |
| |
Spatial and Temporal Duplication | |
| |
| |
| |
Error-Detecting Codes | |
| |
| |
| |
Are These Countermeasures Sufficient? | |
| |
| |
| |
Final Comment | |
| |
| |
| |
Further Reading | |
| |
| |
| |
Exercises | |
| |
| |
References | |
| |
| |
| |
Simulation Techniques | |
| |
| |
| |
Writing a Simulation Program | |
| |
| |
| |
Parameter Estimation | |
| |
| |
| |
Point Versus Interval Estimation | |
| |
| |
| |
Method of Moments | |
| |
| |
| |
Method of Maximum Likelihood | |
| |
| |
| |
The Bayesian Approach to Parameter Estimation | |
| |
| |
| |
Confidence Intervals | |
| |
| |
| |
Variance Reduction Methods | |
| |
| |
| |
Antithetic Variables | |
| |
| |
| |
Using Control Variables | |
| |
| |
| |
Stratified Sampling | |
| |
| |
| |
Importance Sampling | |
| |
| |
| |
Random Number Generation | |
| |
| |
| |
Uniformly Distributed Random Number Generators | |
| |
| |
| |
Testing Uniform Random Number Generators | |
| |
| |
| |
Generating Other Distributions | |
| |
| |
| |
Fault Injection | |
| |
| |
| |
Types of Fault Injection Techniques | |
| |
| |
| |
Fault Injection Application and Tools | |
| |
| |
| |
Further Reading | |
| |
| |
| |
Exercises | |
| |
| |
References | |
| |
| |
Subject Index | |