| |
| |
Table of Contents | |
| |
| |
Additional Resources | |
| |
| |
Preface | |
| |
| |
| |
Processor Design | |
| |
| |
| |
The Evolution of Microprocessors | |
| |
| |
| |
Instruction Set Processor Design | |
| |
| |
| |
Digital Systems Design | |
| |
| |
| |
Architecture, Implementation, and Realization | |
| |
| |
| |
Instruction Set Architecture | |
| |
| |
| |
Dynamic-Static Interface | |
| |
| |
| |
Principles of Processor Performance | |
| |
| |
| |
Processor Performance Equation | |
| |
| |
| |
Processor Performance Optimizations | |
| |
| |
| |
Performance Evaluation Method | |
| |
| |
| |
Instruction-Level Parallel Processing | |
| |
| |
| |
From Scalar to Superscalar | |
| |
| |
| |
Limits of Instruction-Level Parallelism | |
| |
| |
| |
Machines for Instruction-Level Parallelism | |
| |
| |
| |
Summary | |
| |
| |
| |
Pipelined Processors | |
| |
| |
| |
Pipelining Fundamentals | |
| |
| |
| |
Pipelined Design | |
| |
| |
| |
Arithmetic Pipeline Example | |
| |
| |
| |
Pipelining Idealism | |
| |
| |
| |
Instruction Pipelining | |
| |
| |
| |
Pipelined Processor Design | |
| |
| |
| |
Balancing Pipeline Stages | |
| |
| |
| |
Unifying Instruction Types | |
| |
| |
| |
Minimizing Pipeline Stalls | |
| |
| |
| |
Commercial Pipelined Processors | |
| |
| |
| |
Deeply Pipelined Processors | |
| |
| |
| |
Summary | |
| |
| |
| |
Memory and I/O Systems | |
| |
| |
| |
Introduction | |
| |
| |
| |
Computer System Overview | |
| |
| |
| |
Key Concepts: Latency and Bandwidth | |
| |
| |
| |
Memory Hierarchy | |
| |
| |
| |
Components of a Modern Memory Hierarchy | |
| |
| |
| |
Temporal and Spatial Locality | |
| |
| |
| |
Caching and Cache Memories | |
| |
| |
| |
Main Memory | |
| |
| |
| |
Virtual Memory Systems | |
| |
| |
| |
Demand Paging | |
| |
| |
| |
Memory Protection | |
| |
| |
| |
Page Table Architectures | |
| |
| |
| |
Memory Hierarchy Implementation | |
| |
| |
| |
Input/Output Systems | |
| |
| |
| |
Types of I/O Devices | |
| |
| |
| |
Computer System Busses | |
| |
| |
| |
Communication with I/O Devices | |
| |
| |
| |
Interaction of I/O Devices and Memory Hierarchy | |
| |
| |
| |
Summary | |
| |
| |
| |
Superscalar Organization | |
| |
| |
| |
Limitations of Scalar Pipelines | |
| |
| |
| |
Upper Bound on Scalar Pipeline Throughput | |
| |
| |
| |
Inefficient Unification into a Single Pipeline | |
| |
| |
| |
Performance Lost Due to a Rigid Pipeline | |
| |
| |
| |
From Scalar to Superscalar Pipelines | |
| |
| |
| |
Parallel Pipelines | |
| |
| |
| |
Diversified Pipelines | |
| |
| |
| |
Dynamic Pipelines | |
| |
| |
| |
Superscalar Pipeline Overview | |
| |
| |
| |
Instruction Fetching | |
| |
| |
| |
Instruction Decoding | |
| |
| |
| |
Instruction Dispatching | |
| |
| |
| |
Instruction Execution | |
| |
| |
| |
Instruction Completion and Retiring | |
| |
| |
| |
Summary | |
| |
| |
| |
Superscalar Techniques | |
| |
| |
| |
Instruction Flow Techniques | |
| |
| |
| |
Program Control Flow and Control Dependences | |
| |
| |
| |
Performance Degradation Due to Branches | |
| |
| |
| |
Branch Prediction Techniques | |
| |
| |
| |
Branch Misprediction Recovery | |
| |
| |
| |
Advanced Branch Prediction Techniques | |
| |
| |
| |
Other Instruction Flow Techniques | |
| |
| |
| |
Register Data Flow Techniques | |
| |
| |
| |
Register Reuse and False Data Dependences | |
| |
| |
| |
Register Renaming Techniques | |
| |
| |
| |
True Data Dependences and the Data Flow Limit | |
| |
| |
| |
The Classic Tomasulo Algorithm | |
| |
| |
| |
Dynamic Execution Core | |
| |
| |
| |
Reservation Stations and Reorder Buffer | |
| |
| |
| |
Dynamic Instruction Scheduler | |
| |
| |
| |
Other Register Data Flow Techniques | |
| |
| |
| |
Memory Data Flow Techniques | |
| |
| |
| |
Memory Accessing Instructions | |
| |
| |
| |
Ordering of Memory Accesses | |
| |
| |
| |
Load Bypassing and Load Forwarding | |
| |
| |
| |
Other Memory Data Flow Techniques | |
| |
| |
| |
Summary | |
| |
| |
| |
The PowerPC 620 | |
| |
| |
| |
Introduction | |
| |
| |
| |
Experimental Framework | |
| |
| |
| |
Instruction Fetching | |
| |
| |
| |
Branch Prediction | |
| |
| |
| |
Fetching and Speculation | |
| |
| |
| |
Instruction Dispatching | |
| |
| |
| |
Instruction Buffer | |
| |
| |
| |
Dispatch Stalls | |
| |
| |
| |
Dispatch Effectiveness | |
| |
| |
| |
Instruction Execution | |
| |
| |
| |
Issue Stalls | |
| |
| |
| |
Execution Parallelism | |
| |
| |
| |
Execution Latency | |
| |
| |
| |
Instruction Completion | |
| |
| |
| |
Completion Parallelism | |
| |
| |
| |
Cache Effects | |
| |
| |
| |
Conclusions and Observations | |
| |
| |
| |
Bridging to the IBM POWER3 and POWER4 | |
| |
| |
| |
Summary | |
| |
| |
| |
Intel's P6 Microarchitecture | |
| |
| |
| |
Introduction | |
| |
| |
| |
Basics of the P6 Microarchitecture | |
| |
| |
| |
Pipelining | |
| |
| |
| |
In-Order Front-End Pipeline | |
| |
| |
| |
Out-of-Order Core Pipeline | |
| |
| |
| |
Retirement Pipeline | |
| |
| |
| |
The In-Order Front End | |
| |
| |
| |
Instruction Cache and ITLB | |
| |
| |
| |
Branch Prediction | |
| |
| |
| |
Instruction Decoder | |
| |
| |
| |
Register Alias Table | |
| |
| |
| |
Allocator | |
| |
| |
| |
The Out-of-Order Core | |
| |
| |
| |
Reservation Station | |
| |
| |
| |
Retirement | |
| |
| |
| |
The Reorder Buffer | |
| |
| |
| |
Memory Subsystem | |
| |
| |
| |
Memory Access Ordering | |
| |
| |
| |
Load Memory Operations | |
| |
| |
| |
Basic Store Memory Operations | |
| |
| |
| |
Deferring Memory Operations | |
| |
| |
| |
Page Faults | |
| |
| |
| |
Summary | |
| |
| |
| |
Acknowledgments | |
| |
| |
| |
Survey of Superscalar Processors | |
| |
| |
| |
Development of Superscalar Processors | |
| |
| |
| |
Early Advances in Uniprocessor Parallelism: The IBM Stretch | |
| |
| |
| |
First Superscalar Design: The IBM Advanced Computer System | |
| |
| |
| |
Instruction-Level Parallelism Studies | |
| |
| |
| |
By-Products of DAE: The First Multiple-Decoding Implementations | |
| |
| |
| |
IBM Cheetah, Panther, and America | |
| |
| |
| |
Decoupled Microarchitectures | |
| |
| |
| |
Other Efforts in the 1980s | |
| |
| |
| |
Wide Acceptance of Superscalar | |
| |
| |
| |
A Classification of Recent Designs | |
| |
| |
| |
RISC and CISC Retrofits | |
| |
| |
| |
Speed Demons: Emphasis on Clock Cycle Time | |
| |
| |
| |
Brainiacs: Emphasis on IPC | |
| |
| |
| |
Processor Descriptions | |
| |
| |
| |
Compaq / DEC Alpha | |
| |
| |
| |
Hewlett-Packard PA-RISC Version 1.0 | |
| |
| |
| |
Hewlett-Packard PA-RISC Version 2.0 | |
| |
| |
| |
IBM POWER | |
| |
| |
| |
Intel i960 | |
| |
| |
| |
Intel IA32--Native Approaches | |
| |
| |
| |
Intel IA32--Decoupled Approaches | |
| |
| |
| |
x86-64 | |
| |
| |
| |
MIPS | |
| |
| |
| |
Motorola | |
| |
| |
| |
PowerPC--32-bit Architecture | |
| |
| |
| |
PowerPC--64-bit Architecture | |
| |
| |
| |
PowerPC-AS | |
| |
| |
| |
SPARC Version 8 | |
| |
| |
| |
SPARC Version 9 | |
| |
| |
| |
Verification of Superscalar Processors | |
| |
| |
| |
Acknowledgments | |
| |
| |
| |
Advanced Instruction Flow Techniques | |
| |
| |
| |
Introduction | |
| |
| |
| |
Static Branch Prediction Techniques | |
| |
| |
| |
Single-Direction Prediction | |
| |
| |
| |
Backwards Taken/Forwards Not-Taken | |
| |
| |
| |
Ball/Larus Heuristics | |
| |
| |
| |
Profiling | |
| |
| |
| |
Dynamic Branch Prediction Techniques | |
| |
| |
| |
Basic Algorithms | |
| |
| |
| |
Interference-Reducing Predictors | |
| |
| |
| |
Predicting with Alternative Contexts | |
| |
| |
| |
Hybrid Branch Predictors | |
| |
| |
| |
The Tournament Predictor | |
| |
| |
| |
Static Predictor Selection | |
| |
| |
| |
Branch Classification | |
| |
| |
| |
The Multihybrid Predictor | |
| |
| |
| |
Prediction Fusion | |
| |
| |
| |
Other Instruction Flow Issues and Techniques | |
| |
| |
| |
Target Prediction | |
| |
| |
| |
Branch Confidence Prediction | |
| |
| |
| |
High-Bandwidth Fetch Mechanisms | |
| |
| |
| |
High-Frequency Fetch Mechanisms | |
| |
| |
| |
Summary | |
| |
| |
| |
Advanced Register Data Flow Techniques | |
| |
| |
| |
Introduction | |
| |
| |
| |
Value Locality and Redundant Execution | |
| |
| |
| |
Causes of Value Locality | |
| |
| |
| |
Quantifying Value Locality | |
| |
| |
| |
Exploiting Value Locality without Speculation | |
| |
| |
| |
Memoization | |
| |
| |
| |
Instruction Reuse | |
| |
| |
| |
Basic Block and Trace Reuse | |
| |
| |
| |
Data Flow Region Reuse | |
| |
| |
| |
Concluding Remarks | |
| |
| |
| |
Exploiting Value Locality with Speculation | |
| |
| |
| |
The Weak Dependence Model | |
| |
| |
| |
Value Prediction | |
| |
| |
| |
The Value Prediction Unit | |
| |
| |
| |
Speculative Execution Using Predicted Values | |
| |
| |
| |
Performance of Value Prediction | |
| |
| |
| |
Concluding Remarks | |
| |
| |
| |
Summary | |
| |
| |
| |
Executing Multiple Threads | |
| |
| |
| |
Introduction | |
| |
| |
| |
Synchronizing Shared-Memory Threads | |
| |
| |
| |
Introduction to Multiprocessor Systems | |
| |
| |
| |
Fully Shared Memory, Unit Latency, and Lack of Contention | |
| |
| |
| |
Instantaneous Propagation of Writes | |
| |
| |
| |
Coherent Shared Memory | |
| |
| |
| |
Implementing Cache Coherence | |
| |
| |
| |
Multilevel Caches, Inclusion, and Virtual Memory | |
| |
| |
| |
Memory Consistency | |
| |
| |
| |
The Coherent Memory Interface | |
| |
| |
| |
Concluding Remarks | |
| |
| |
| |
Explicitly Multithreaded Processors | |
| |
| |
| |
Chip Multiprocessors | |
| |
| |
| |
Fine-Grained Multithreading | |
| |
| |
| |
Coarse-Grained Multithreading | |
| |
| |
| |
Simultaneous Multithreading | |
| |
| |
| |
Implicitly Multithreaded Processors | |
| |
| |
| |
Resolving Control Dependences | |
| |
| |
| |
Resolving Register Data Dependences | |
| |
| |
| |
Resolving Memory Data Dependences | |
| |
| |
| |
Concluding Remarks | |
| |
| |
| |
Executing the Same Thread | |
| |
| |
| |
Fault Detection | |
| |
| |
| |
Prefetching | |
| |
| |
| |
Branch Resolution | |
| |
| |
| |
Concluding Remarks | |
| |
| |
| |
Summary | |
| |
| |
Index | |