PhD, MTech, MS(R), 4th year UG
To obtain good performance, one needs to write correct but scalable parallel programs using programming language abstractions like threads. In addition, the developer needs to be aware of and utilize many architecture-specific features like vectorization to extract the full performance potential. In this course, we will discuss programming language abstractions with architecture-aware development to learn to write scalable parallel programs.
This course will involve programming assignments to use the concepts learnt in class and appreciate the challenges in extracting performance.
The course will primarily focus on the following topics:
Introduction: Challenges in parallel programming, correctness and performance errors, understanding performance, performance models
Exploiting spatial and temporal locality with caches, analytical cache miss analysis
Compiler transformations: Dependence analysis, Loop Transformations
Shared-memory programming and Pthreads
Compiler vectorization: vector ISA, auto-vectorizing compiler, vector intrinsics, assembly
OpenMP: Core OpenMP, Advanced OpenMP, Heterogeneous programming with OpenMP
Parallel Programming Models and Patterns
Intel Threading Building Blocks
GPGPU programming: GPU architecture and CUDA Programming
Performance bottleneck analysis: PAPI counters, Using performance analysis tools
Heterogeneous Programming with OpenMP
Fork-Join Parallelism
Concurrent data structures
Shared-memory synchronization
Memory consistency models
Transactional memory