| |
| |
Foreword | |
| |
| |
Preface | |
| |
| |
About the authors | |
| |
| |
List of acronyms and abbreviations | |
| |
| |
| |
Modern processors | |
| |
| |
| |
Stored-program computer architecture | |
| |
| |
| |
General-purpose cache-based microprocessor architecture | |
| |
| |
| |
Performance metrics and benchmarks | |
| |
| |
| |
Transistors galore: Moore's Law | |
| |
| |
| |
Pipelining | |
| |
| |
| |
Superscalarity | |
| |
| |
| |
SIMD | |
| |
| |
| |
Memory hierarchies | |
| |
| |
| |
Cache | |
| |
| |
| |
Cache mapping | |
| |
| |
| |
Prefetch | |
| |
| |
| |
Multicore processors | |
| |
| |
| |
Multithreaded processors | |
| |
| |
| |
Vector processors | |
| |
| |
| |
Design principles | |
| |
| |
| |
Maximum performance estimates | |
| |
| |
| |
Programming for vector architectures | |
| |
| |
| |
Basic optimization techniques for serial code | |
| |
| |
| |
Scalar profiling | |
| |
| |
| |
Function- and line-based runtime profiling | |
| |
| |
| |
Hardware performance counters | |
| |
| |
| |
Manual instrumentation | |
| |
| |
| |
Common sense optimizations | |
| |
| |
| |
Do less work! | |
| |
| |
| |
Avoid expensive operations! | |
| |
| |
| |
Shrink the working set! | |
| |
| |
| |
Simple measures, large impact | |
| |
| |
| |
Elimination of common subexpressions | |
| |
| |
| |
Avoiding branches | |
| |
| |
| |
Using SIMD instruction sets | |
| |
| |
| |
The role of compilers | |
| |
| |
| |
General optimization options | |
| |
| |
| |
Inlining | |
| |
| |
| |
Aliasing | |
| |
| |
| |
Computational accuracy | |
| |
| |
| |
Register optimizations | |
| |
| |
| |
Using compiler logs | |
| |
| |
| |
C++ optimizations | |
| |
| |
| |
Temporaries | |
| |
| |
| |
Dynamic memory management | |
| |
| |
| |
Loop kernels and iterators | |
| |
| |
| |
Data access optimization | |
| |
| |
| |
Balance analysis and lightspeed estimates | |
| |
| |
| |
Bandwidth-based performance modeling | |
| |
| |
| |
The STREAM benchmarks | |
| |
| |
| |
Storage order | |
| |
| |
| |
Case study: The Jacobi algorithm | |
| |
| |
| |
Case study: Dense matrix transpose | |
| |
| |
| |
Algorithm classification and access optimizations | |
| |
| |
| |
O(N)/O(N) | |
| |
| |
| |
O(N<sup>2</sup>)/O(N<sup>2</sup>) | |
| |
| |
| |
O(N<sup>3</sup>)/O(N<sup>2</sup>) | |
| |
| |
| |
Case study: Sparse matrix-vector multiply | |
| |
| |
| |
Sparse matrix storage schemes | |
| |
| |
| |
Optimizing JDS sparse MVM | |
| |
| |
| |
Parallel computers | |
| |
| |
| |
Taxonomy of parallel computing paradigms | |
| |
| |
| |
Shared-memory computers | |
| |
| |
| |
Cache coherence | |
| |
| |
| |
UMA | |
| |
| |
| |
ccNUMA | |
| |
| |
| |
Distributed-memory computers | |
| |
| |
| |
Hierarchical (hybrid) systems | |
| |
| |
| |
Networks | |
| |
| |
| |
Basic performance characteristics of networks | |
| |
| |
| |
Buses | |
| |
| |
| |
Switched and fat-tree networks | |
| |
| |
| |
Mesh networks | |
| |
| |
| |
Hybrids | |
| |
| |
| |
Basics of parallelization | |
| |
| |
| |
Why parallelize? | |
| |
| |
| |
Parallelism | |
| |
| |
| |
Data parallelism | |
| |
| |
| |
Functional parallelism | |
| |
| |
| |
Parallel scalability | |
| |
| |
| |
Factors that limit parallel execution | |
| |
| |
| |
Scalability metrics | |
| |
| |
| |
Simple scalability laws | |
| |
| |
| |
Parallel efficiency | |
| |
| |
| |
Serial performance versus strong scalability | |
| |
| |
| |
Refined performance models | |
| |
| |
| |
Choosing the right scaling baseline | |
| |
| |
| |
Case study: Can slower processors compute faster? | |
| |
| |
| |
Load imbalance | |
| |
| |
| |
Shared-memory parallel programming with OpenMP | |
| |
| |
| |
Short introduction to OpenMP | |
| |
| |
| |
Parallel execution | |
| |
| |
| |
Data scoping | |
| |
| |
| |
OpenMP worksharing for loops | |
| |
| |
| |
Synchronization | |
| |
| |
| |
Reductions | |
| |
| |
| |
Loop scheduling | |
| |
| |
| |
Tasking | |
| |
| |
| |
Miscellaneous | |
| |
| |
| |
Case study: OpenMP-parallel Jacobi algorithm | |
| |
| |
| |
Advanced OpenMP: Wavefront parallelization | |
| |
| |
| |
Efficient OpenMP programming | |
| |
| |
| |
Profiling OpenMP programs | |
| |
| |
| |
Performance pitfalls | |
| |
| |
| |
Ameliorating the impact of OpenMP worksharing constructs | |
| |
| |
| |
Determining OpenMP overhead for short loops | |
| |
| |
| |
Serialization | |
| |
| |
| |
False sharing | |
| |
| |
| |
Case study: Parallel sparse matrix-vector multiply | |
| |
| |
| |
Locality optimizations on ccNUMA architectures | |
| |
| |
| |
Locality of access on ccNUMA | |
| |
| |
| |
Page placement by first touch | |
| |
| |
| |
Access locality by other means | |
| |
| |
| |
Case study: ccNUMA optimization of sparse MVM | |
| |
| |
| |
Placement pitfalls | |
| |
| |
| |
NUMA-unfriendly OpenMP scheduling | |
| |
| |
| |
File system cache | |
| |
| |
| |
ccNUMA issues with C++ | |
| |
| |
| |
Arrays of objects | |
| |
| |
| |
Standard Template Library | |
| |
| |
| |
Distributed-memory parallel programming with MPI | |
| |
| |
| |
Message passing | |
| |
| |
| |
A short introduction to MPI | |
| |
| |
| |
A simple example | |
| |
| |
| |
Messages and point-to-point communication | |
| |
| |
| |
Collective communication | |
| |
| |
| |
Nonblocking point-to-point communication | |
| |
| |
| |
Virtual topologies | |
| |
| |
| |
Example: MPI parallelization of a Jacobi solver | |
| |
| |
| |
MPI implementation | |
| |
| |
| |
Performance properties | |
| |
| |
| |
Efficient MPI programming | |
| |
| |
| |
MPI performance tools | |
| |
| |
| |
Communication parameters | |
| |
| |
| |
Synchronization, serialization, contention | |
| |
| |
| |
Implicit serialization and synchronization | |
| |
| |
| |
Contention | |
| |
| |
| |
Reducing communication overhead | |
| |
| |
| |
Optimal domain decomposition | |
| |
| |
| |
Aggregating messages | |
| |
| |
| |
Nonblocking vs. asynchronous communication | |
| |
| |
| |
Collective communication | |
| |
| |
| |
Understanding intranode point-to-point communication | |
| |
| |
| |
Hybrid parallelization with MPI and OpenMP | |
| |
| |
| |
Basic MPI/OpenMP programming models | |
| |
| |
| |
Vector mode implementation | |
| |
| |
| |
Task mode implementation | |
| |
| |
| |
Case study: Hybrid Jacobi solver | |
| |
| |
| |
MPI taxonomy of thread interoperability | |
| |
| |
| |
Hybrid decomposition and mapping | |
| |
| |
| |
Potential benefits and drawbacks of hybrid programming | |
| |
| |
| |
Topology and affinity in multicore environments | |
| |
| |
| |
Topology | |
| |
| |
| |
Thread and process placement | |
| |
| |
| |
External affinity control | |
| |
| |
| |
Affinity under program control | |
| |
| |
| |
Page placement beyond first touch | |
| |
| |
| |
Solutions to the problems | |
| |
| |
Bibliography | |
| |
| |
Index | |