Skip to content

CUDA Fortran for Scientists and Engineers Best Practices for Efficient CUDA Fortran Programming

Best in textbook rentals since 2012!

ISBN-10: 0124169708

ISBN-13: 9780124169708

Edition: 2014

Authors: Gregory Ruetsch, Massimiliano Fatica

Shipping box This item qualifies for FREE shipping.
Blue ribbon 30 day, 100% satisfaction guarantee!
what's this?
Rush Rewards U
Members Receive:
Carrot Coin icon
XP icon
You have reached 400 XP and carrot coins. That is the daily max!

Description:

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran. In order to add CUDA Fortran to existing Fortran codes, they explain how to understand the target GPU architecture, identify computationally intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance - all in Fortran, without having to rewrite in another language.…    
Customers also bought

Book details

Copyright year: 2014
Publisher: Elsevier Science & Technology
Publication date: 10/24/2013
Binding: Paperback
Pages: 338
Size: 7.50" wide x 9.21" long x 0.75" tall
Weight: 1.540
Language: English

Greg Ruetsch is a Senior Applied Engineer at NVIDIA, where he works on CUDA Fortran and performance optimization of HPC codes. He holds a Bachelor's degree in mechanical and aerospace engineering from Rutgers University and a Ph.D. in applied mathematics from Brown University. Prior to joining NVIDIA he has held research positions at Stanford University's Center for Turbulence Research and Sun Microsystems Laboratories.

Massimiliano Fatica is the manager of the Tesla HPC Group at NVIDIA where he works in the area of GPU computing (high-performance computing and clusters). He holds a laurea in Aeronautical Engineering and a Phd in Theoretical and Applied Mechanics from the University of Rome "La Sapienza. Prior to joining NVIDIA, he was a research staff member at Stanford University where he worked at the Center for Turbulence Research and Center for Integrated Turbulent Simulations on applications for the Stanford Streaming Supercomputer.

Acknowledgments
Preface
Cuda Fortran Programming
Introduction
A Brief History of GPU Computing
Parallel Computation
Basic Concepts
A First CUDA Fortran Program
Extending to Larger Arrays
Multidimensional Arrays
Determining CUDA Hardware Features and Limits
Single and Double Precision
Error Handling
Compiling CUDA Fortran Code
Separate Compilation
Performance Measurement and Metrics
Measuring Kernel Execution Time
Host-Device Synchronization and CPU Timers
Timing via CUDA Events
Command Line Profiler
The nvprof Profiling Tool
Instruction, Bandwidth, and Latency Bound Kernels
Memory Bandwidth
Theoretical Peak Bandwidth
Effective Bandwidth
Actual Data Throughput vs. Effective Bandwidth
Optimization
Transfers between Host and Device
Pinned Memory
Batching Small Data Transfers
Asynchronous Data Transfers (Advanced Topic)
Device Memory
Declaring Data in Device Code
Coalesced Access to Global Memory
Texture Memory
Local Memory
Constant Memory
On-Chip Memory
L1 Cache
Registers
Shared Memory
Memory Optimization Example: Matrix Transpose
Partition Camping (Advanced Topic)
Execution Configuration
Thread-Level Parallelism
Instruction-Level Parallelism
Instruction Optimization
Device Intrinsics
Compiler Options
Divergent Warps
Kernel Loop Directives
Reductions in CUF Kernels
Streams in CUF Kernels
Instruction-Level Parallelism in CUF Kernels
Multi-GPU Programming
CUDA Multi-GPU Features
Peer-to-Peer Communication
Peer-to-Peer Direct Transfers
Peer-to-Peer Transpose
Multi-GPU Programming with MPI
Assigning Devices to MPI Ranks
MPI Transpose
GPU-Aware MPI Transpose
Case Studies
Monte Carlo Method
CURAND
Computing � with CUF Kernels
EEEE-754 Precision (Advanced Topic)
Computing � with Reduction Kernels
Reductions with Atomic Locks (Advanced Topic)
Accuracy of Summation
Option Pricing
Finite Difference Method
Nine-Point ID Finite Difference Stencil
Data Reuse and Shared Memory
The x-Derivative Kernel
Derivatives in y and z
Nonuniform Grids
2D Laplace Equation
Applications of Fast Fourier Transform
CUFFT
Spectral Derivatives
Convolution
Poisson Solver
Appendices
Tesla Specifications
System and Environment Management
Environment Variables
General
Command Line Profiler
Just-in-Time Compilation
nvidia-smi System Management Interface
Enabling and Disabling ECC
Compute Mode
Persistence Mode
Calling CUDA C from CUDA Fortran
Calling CUDA C Libraries
Calling User-Written CUDA C Code
Source Code
Texture Memory
Matrix Transpose
Thread- and Instruction-Level Parallelism
Multi-GPU Programming
Peer-to-Peer Transpose
MPI Transpose with Host MPI Transfers
MPI Transpose with Device MPI Transfers
Finite Difference Code
Spectral Poisson Solver
References
Index