| |
| |
Preface | |
| |
| |
Acknowledgments | |
| |
| |
Dedication | |
| |
| |
| |
Introduction | |
| |
| |
| |
GPUs as Parallel Computers | |
| |
| |
| |
Architecture of a Modern GPU | |
| |
| |
| |
Why More Speed or Parallelism? | |
| |
| |
| |
Parallel Programming Languages and Models | |
| |
| |
| |
Overarching Goals | |
| |
| |
| |
Organization of the Book | |
| |
| |
| |
History Of GPU Computing | |
| |
| |
| |
Evolution of Graphics Pipelines | |
| |
| |
| |
The Era of Fixed-Function Graphics Pipelines | |
| |
| |
| |
Evolution of Programmable Real-Time Graphics | |
| |
| |
| |
Unified Graphics and Computing Processors | |
| |
| |
| |
GPGPU: An Intermediate Step | |
| |
| |
| |
GPU Computing | |
| |
| |
| |
Scalable GPUs | |
| |
| |
| |
Recent Developments | |
| |
| |
| |
Future Trends | |
| |
| |
| |
Introduction To Cuda | |
| |
| |
| |
Data Parallelism | |
| |
| |
| |
Cuda Program Structure | |
| |
| |
| |
A Matrix-Matrix Multiplication Example | |
| |
| |
| |
Device Memories and Data Transfer | |
| |
| |
| |
Kernel Functions and Threading | |
| |
| |
| |
Summary | |
| |
| |
| |
Function declarations | |
| |
| |
| |
Kernel launch | |
| |
| |
| |
Predefined variables | |
| |
| |
| |
Runtime API | |
| |
| |
| |
Cuda Threads | |
| |
| |
| |
Cuda Thread Organization | |
| |
| |
| |
Using blockIdx and threadIdx | |
| |
| |
| |
Synchronization and Transparent Scalability | |
| |
| |
| |
Thread Assignment | |
| |
| |
| |
Thread Scheduling and Latency Tolerance | |
| |
| |
| |
Summary | |
| |
| |
| |
Exercises | |
| |
| |
| |
Cuda� Memories | |
| |
| |
| |
Importance of Memory Access Efficiency | |
| |
| |
| |
CUDA Device Memory Types | |
| |
| |
| |
A Strategy for Reducing Global Memory Traffic | |
| |
| |
| |
Memory as a Limiting Factor to Parallelism | |
| |
| |
| |
Summary | |
| |
| |
| |
Exercises | |
| |
| |
| |
Performance On Siderations | |
| |
| |
| |
More on Thread Execution | |
| |
| |
| |
Global Memory Bandwidth | |
| |
| |
| |
Dynamic Partitioning of SM Resources | |
| |
| |
| |
Data Prefetching | |
| |
| |
| |
Instruction Mix | |
| |
| |
| |
Thread Granularity | |
| |
| |
| |
Measured Performance and Summary | |
| |
| |
| |
Exercises | |
| |
| |
| |
Floating Point Considerations | |
| |
| |
| |
Floating-Point Format | |
| |
| |
| |
Normalized Representation of M | |
| |
| |
| |
Excess Encoding of E | |
| |
| |
| |
Representable Numbers | |
| |
| |
| |
Special Bit Patterns and Precision | |
| |
| |
| |
Arithmetic Accuracy and Rounding | |
| |
| |
| |
Algorithm Considerations | |
| |
| |
| |
Summary | |
| |
| |
| |
Exercises | |
| |
| |
| |
Application Case Study: Advanced MRI Reconstruction | |
| |
| |
| |
Application Background | |
| |
| |
| |
Iterative Reconstruction | |
| |
| |
| |
Computing F<sup>H</sup>d | |
| |
| |
| |
Determine the Kernel Parallelism Structure | |
| |
| |
| |
Getting Around the Memory Bandwidth Limitation | |
| |
| |
| |
Using Hardware Trigonometry Functions | |
| |
| |
| |
Experimental Performance Tuning | |
| |
| |
| |
Final Evaluation | |
| |
| |
| |
Exercises | |
| |
| |
| |
Application Case Study: Molecular Visualization and Analysis | |
| |
| |
| |
Application Background | |
| |
| |
| |
A Simple Kernel Implementation | |
| |
| |
| |
Instruction Execution Efficiency | |
| |
| |
| |
Memory Coalescing | |
| |
| |
| |
Additional Performance Comparisons | |
| |
| |
| |
Using Multiple GPUs | |
| |
| |
| |
Exercises | |
| |
| |
| |
Parallel Programming and Computational Thinking | |
| |
| |
| |
Goals of Parallel Programming | |
| |
| |
| |
Problem Decomposition | |
| |
| |
| |
Algorithm Selection | |
| |
| |
| |
Computational Thinking | |
| |
| |
| |
Exercises | |
| |
| |
| |
A Brief Introduction To Opencl� | |
| |
| |
| |
Background | |
| |
| |
| |
Data Parallelism Model | |
| |
| |
| |
Device Architecture | |
| |
| |
| |
Kernel Functions | |
| |
| |
| |
Device Management and Kernel Launch | |
| |
| |
| |
Electrostatic Potential Map in OpenCL | |
| |
| |
| |
Summary | |
| |
| |
| |
Exercises | |
| |
| |
| |
Conclusion And Future Outlook | |
| |
| |
| |
Goals Revisited | |
| |
| |
| |
Memory Architecture Evolution | |
| |
| |
| |
Large Virtual and Physical Address Spaces | |
| |
| |
| |
Unified Device Memory Space | |
| |
| |
| |
Configurable Caching and Scratch Pad | |
| |
| |
| |
Enhanced Atomic Operations | |
| |
| |
| |
Enhanced Global Memory Access | |
| |
| |
| |
Kernel Execution Control Evolution | |
| |
| |
| |
Function Calls within Kernel Functions | |
| |
| |
| |
Exception Handling in Kernel Functions | |
| |
| |
| |
Simultaneous Execution of Multiple Kernels | |
| |
| |
| |
Interruptible Kernels | |
| |
| |
| |
Core Performance | |
| |
| |
| |
Double-Precision Speed | |
| |
| |
| |
Better Control Flow Efficiency | |
| |
| |
| |
Programming Environment | |
| |
| |
| |
A Bright Outlook | |
| |
| |
| |
Matrix Multiplication Host-Only Version Source Code | |
| |
| |
| |
matrixmul . cu | |
| |
| |
| |
matrixmul_gold.cpp | |
| |
| |
| |
matrixmul . h | |
| |
| |
| |
assist.h | |
| |
| |
| |
Expected Output | |
| |
| |
| |
GPU Compute Capabilities | |
| |
| |
| |
GPU Compute Capability Tables | |
| |
| |
| |
Memory Coalescing Variations | |
| |
| |
Index | |