| |
| |
Acknowledgments | |
| |
| |
Preface | |
| |
| |
| |
Cuda Fortran Programming | |
| |
| |
| |
Introduction | |
| |
| |
| |
A Brief History of GPU Computing | |
| |
| |
| |
Parallel Computation | |
| |
| |
| |
Basic Concepts | |
| |
| |
| |
A First CUDA Fortran Program | |
| |
| |
| |
Extending to Larger Arrays | |
| |
| |
| |
Multidimensional Arrays | |
| |
| |
| |
Determining CUDA Hardware Features and Limits | |
| |
| |
| |
Single and Double Precision | |
| |
| |
| |
Error Handling | |
| |
| |
| |
Compiling CUDA Fortran Code | |
| |
| |
| |
Separate Compilation | |
| |
| |
| |
Performance Measurement and Metrics | |
| |
| |
| |
Measuring Kernel Execution Time | |
| |
| |
| |
Host-Device Synchronization and CPU Timers | |
| |
| |
| |
Timing via CUDA Events | |
| |
| |
| |
Command Line Profiler | |
| |
| |
| |
The nvprof Profiling Tool | |
| |
| |
| |
Instruction, Bandwidth, and Latency Bound Kernels | |
| |
| |
| |
Memory Bandwidth | |
| |
| |
| |
Theoretical Peak Bandwidth | |
| |
| |
| |
Effective Bandwidth | |
| |
| |
| |
Actual Data Throughput vs. Effective Bandwidth | |
| |
| |
| |
Optimization | |
| |
| |
| |
Transfers between Host and Device | |
| |
| |
| |
Pinned Memory | |
| |
| |
| |
Batching Small Data Transfers | |
| |
| |
| |
Asynchronous Data Transfers (Advanced Topic) | |
| |
| |
| |
Device Memory | |
| |
| |
| |
Declaring Data in Device Code | |
| |
| |
| |
Coalesced Access to Global Memory | |
| |
| |
| |
Texture Memory | |
| |
| |
| |
Local Memory | |
| |
| |
| |
Constant Memory | |
| |
| |
| |
On-Chip Memory | |
| |
| |
| |
L1 Cache | |
| |
| |
| |
Registers | |
| |
| |
| |
Shared Memory | |
| |
| |
| |
Memory Optimization Example: Matrix Transpose | |
| |
| |
| |
Partition Camping (Advanced Topic) | |
| |
| |
| |
Execution Configuration | |
| |
| |
| |
Thread-Level Parallelism | |
| |
| |
| |
Instruction-Level Parallelism | |
| |
| |
| |
Instruction Optimization | |
| |
| |
| |
Device Intrinsics | |
| |
| |
| |
Compiler Options | |
| |
| |
| |
Divergent Warps | |
| |
| |
| |
Kernel Loop Directives | |
| |
| |
| |
Reductions in CUF Kernels | |
| |
| |
| |
Streams in CUF Kernels | |
| |
| |
| |
Instruction-Level Parallelism in CUF Kernels | |
| |
| |
| |
Multi-GPU Programming | |
| |
| |
| |
CUDA Multi-GPU Features | |
| |
| |
| |
Peer-to-Peer Communication | |
| |
| |
| |
Peer-to-Peer Direct Transfers | |
| |
| |
| |
Peer-to-Peer Transpose | |
| |
| |
| |
Multi-GPU Programming with MPI | |
| |
| |
| |
Assigning Devices to MPI Ranks | |
| |
| |
| |
MPI Transpose | |
| |
| |
| |
GPU-Aware MPI Transpose | |
| |
| |
| |
Case Studies | |
| |
| |
| |
Monte Carlo Method | |
| |
| |
| |
CURAND | |
| |
| |
| |
Computing � with CUF Kernels | |
| |
| |
| |
EEEE-754 Precision (Advanced Topic) | |
| |
| |
| |
Computing � with Reduction Kernels | |
| |
| |
| |
Reductions with Atomic Locks (Advanced Topic) | |
| |
| |
| |
Accuracy of Summation | |
| |
| |
| |
Option Pricing | |
| |
| |
| |
Finite Difference Method | |
| |
| |
| |
Nine-Point ID Finite Difference Stencil | |
| |
| |
| |
Data Reuse and Shared Memory | |
| |
| |
| |
The x-Derivative Kernel | |
| |
| |
| |
Derivatives in y and z | |
| |
| |
| |
Nonuniform Grids | |
| |
| |
| |
2D Laplace Equation | |
| |
| |
| |
Applications of Fast Fourier Transform | |
| |
| |
| |
CUFFT | |
| |
| |
| |
Spectral Derivatives | |
| |
| |
| |
Convolution | |
| |
| |
| |
Poisson Solver | |
| |
| |
| |
Appendices | |
| |
| |
| |
Tesla Specifications | |
| |
| |
| |
System and Environment Management | |
| |
| |
| |
Environment Variables | |
| |
| |
| |
General | |
| |
| |
| |
Command Line Profiler | |
| |
| |
| |
Just-in-Time Compilation | |
| |
| |
| |
nvidia-smi System Management Interface | |
| |
| |
| |
Enabling and Disabling ECC | |
| |
| |
| |
Compute Mode | |
| |
| |
| |
Persistence Mode | |
| |
| |
| |
Calling CUDA C from CUDA Fortran | |
| |
| |
| |
Calling CUDA C Libraries | |
| |
| |
| |
Calling User-Written CUDA C Code | |
| |
| |
| |
Source Code | |
| |
| |
| |
Texture Memory | |
| |
| |
| |
Matrix Transpose | |
| |
| |
| |
Thread- and Instruction-Level Parallelism | |
| |
| |
| |
Multi-GPU Programming | |
| |
| |
| |
Peer-to-Peer Transpose | |
| |
| |
| |
MPI Transpose with Host MPI Transfers | |
| |
| |
| |
MPI Transpose with Device MPI Transfers | |
| |
| |
| |
Finite Difference Code | |
| |
| |
| |
Spectral Poisson Solver | |
| |
| |
References | |
| |
| |
Index | |