Introduction to High Performance Computing for Scientists and Eng

Name: Introduction to High Performance Computing for Scientists and Eng
Price: 35.22 USD
Availability: InStock
ISBN: 9781439811924

ISBN-10: 143981192X

ISBN-13: 9781439811924

Edition: 2010

Authors: Georg Hager, Gerhard Wellein

List price: $46.99

30 day, 100% satisfaction guarantee!

Sell

Get cash fast!

Marketplace

2 new & used from $35.22

what's this?

Rush Rewards U
Members Receive:

You have reached 400 XP and carrot coins. That is the daily max!

Book details

List price: $46.99
Copyright year: 2010
Publisher: CRC Press
Publication date: 7/7/2010
Binding: Paperback
Pages: 356
Size: 5.75" wide x 9.00" long x 0.75" tall
Weight: 1.342
Language: English

University of Erlangen Nuremberg, Germany



Foreword


Preface


About the authors


List of acronyms and abbreviations



Modern processors



Stored-program computer architecture



General-purpose cache-based microprocessor architecture



Performance metrics and benchmarks



Transistors galore: Moore's Law



Pipelining



Superscalarity



SIMD



Memory hierarchies



Cache



Cache mapping



Prefetch



Multicore processors



Multithreaded processors



Vector processors



Design principles



Maximum performance estimates



Programming for vector architectures



Basic optimization techniques for serial code



Scalar profiling



Function- and line-based runtime profiling



Hardware performance counters



Manual instrumentation



Common sense optimizations



Do less work!



Avoid expensive operations!



Shrink the working set!



Simple measures, large impact



Elimination of common subexpressions



Avoiding branches



Using SIMD instruction sets



The role of compilers



General optimization options



Inlining



Aliasing



Computational accuracy



Register optimizations



Using compiler logs



C++ optimizations



Temporaries



Dynamic memory management



Loop kernels and iterators



Data access optimization



Balance analysis and lightspeed estimates



Bandwidth-based performance modeling



The STREAM benchmarks



Storage order



Case study: The Jacobi algorithm



Case study: Dense matrix transpose



Algorithm classification and access optimizations



O(N)/O(N)



O(N<sup>2</sup>)/O(N<sup>2</sup>)



O(N<sup>3</sup>)/O(N<sup>2</sup>)



Case study: Sparse matrix-vector multiply



Sparse matrix storage schemes



Optimizing JDS sparse MVM



Parallel computers



Taxonomy of parallel computing paradigms



Shared-memory computers



Cache coherence



UMA



ccNUMA



Distributed-memory computers



Hierarchical (hybrid) systems



Networks



Basic performance characteristics of networks



Buses



Switched and fat-tree networks



Mesh networks



Hybrids



Basics of parallelization



Why parallelize?



Parallelism



Data parallelism



Functional parallelism



Parallel scalability



Factors that limit parallel execution



Scalability metrics



Simple scalability laws



Parallel efficiency



Serial performance versus strong scalability



Refined performance models



Choosing the right scaling baseline



Case study: Can slower processors compute faster?



Load imbalance



Shared-memory parallel programming with OpenMP



Short introduction to OpenMP



Parallel execution



Data scoping



OpenMP worksharing for loops



Synchronization



Reductions



Loop scheduling



Tasking



Miscellaneous



Case study: OpenMP-parallel Jacobi algorithm



Advanced OpenMP: Wavefront parallelization



Efficient OpenMP programming



Profiling OpenMP programs



Performance pitfalls



Ameliorating the impact of OpenMP worksharing constructs



Determining OpenMP overhead for short loops



Serialization



False sharing



Case study: Parallel sparse matrix-vector multiply



Locality optimizations on ccNUMA architectures



Locality of access on ccNUMA



Page placement by first touch



Access locality by other means



Case study: ccNUMA optimization of sparse MVM



Placement pitfalls



NUMA-unfriendly OpenMP scheduling



File system cache



ccNUMA issues with C++



Arrays of objects



Standard Template Library



Distributed-memory parallel programming with MPI



Message passing



A short introduction to MPI



A simple example



Messages and point-to-point communication



Collective communication



Nonblocking point-to-point communication



Virtual topologies



Example: MPI parallelization of a Jacobi solver



MPI implementation



Performance properties



Efficient MPI programming



MPI performance tools



Communication parameters



Synchronization, serialization, contention



Implicit serialization and synchronization



Contention



Reducing communication overhead



Optimal domain decomposition



Aggregating messages



Nonblocking vs. asynchronous communication



Collective communication



Understanding intranode point-to-point communication



Hybrid parallelization with MPI and OpenMP



Basic MPI/OpenMP programming models



Vector mode implementation



Task mode implementation



Case study: Hybrid Jacobi solver



MPI taxonomy of thread interoperability



Hybrid decomposition and mapping



Potential benefits and drawbacks of hybrid programming



Topology and affinity in multicore environments



Topology



Thread and process placement



External affinity control



Affinity under program control



Page placement beyond first touch



Solutions to the problems


Bibliography


Index