SuperComputing 2009 tutorials

Tutorial Day 1 - Application Supercomputing and the Many-Core Paradigm Shift

Parallel Computing Architecture and Resources

Alice Koniges, Berkeley Lab/NERSC

o The distinction between what is a “Massively parallel processor” (MPP) and a “cluster” is diminishing.

o Parallel Vector Processors (PVPs) no longer separate entities, now often integrated vector/SIMD instruction sets in COTS processors (SSE etc)

o Shared memory multiprocessors (SMPs) not so popular anymore

o ccNUMA (cache-coherent Non-Uniform Memory Access) became one way to scale up SMPs still widely supported within servers but not common between servers

o Clusters – Beowulfs now most common architecture in Top500. Constructed from commodity processors and interconnects.

o Most parallel systems built from smaller SMP building blocks (e.g. multiple cores on a single chip and multiple CPUs inside one server are often connected as SMPs)

o Different programming models have been tried but now “MPI everywhere” reigns with OpenMP within nodes

o What’s wrong with “MPI everywhere?”

o Can run one MPI process per core

o But wastes intra-chip latency and bandwidth benefits

§ 100X lower latency and 100X higher bandwidth on-chip than off-chip

o May only scale to 4-8 cores anyway but will stop scaling (soon?)

§ Latency – data copying for MPI

§ Etc.

o Number of cores per chip now doubling every 18 months (really? I thought it was lower than this, though still fast)

o Computing performance now limited by power dissipation

o Major trend towards more, simpler, lower power cores to maximise power-limited performance

o Multi-core CPUs now tending to share a large, on-chip L3 cache with DRAM sitting beyond (and accessed via) the L3 cache

o Important point – OpenMP doesn’t have a concept of of Non-uniform memory (so even multi-CPU SMP can cause it problems, where DRAM sitting off other processors is non-uniform w.r.t. performance)

o Sequoia will have ~1.5 million cores and even more threads with Speculative Execution (SE) and Transactional Memory (TM)

o OpenMP/MPI combo wasn’t possible with BlueGene/L (lightweight kernel) but is now possible with BlueGene/P

o Challenging to partition hardware and software threads optimally

o E.g. balance between MPI and OpenMP threads

o Not currently clear how GPUs are going to be integrated into systems:

o Will they be the machine or integrated as accelerators to the machine?

o Common benchmarks:

o HPC challenge

o LINPACK (Top500
Stream

o NAS parallel benchmarks (simple kernels)

o SPEC

o Good idea: try plotting Amdahl’s Law for parallel fractions and number of processors (Speedup = 1/((1-f)+(f/p))) – try it with large number of processors

Parallel Programming Models

Rusty Lusk, Mathematics and Computer Science, Division, Argonne National Laboratory

o One way of looking at Parallel programming models:

o Shared memory

o Distributed memory

o Some of each

o Less explicit

o Express parallel programming models:

o In libraries

o In languages

o In structured comments (pragmas, e.g. OpenMP, HPF) which act as hints to compilers

o HPF gives the user some control over data placement, but OpenMP does not (an important distinction)

o PGAS (Partitioned Global Address Space) languages

o Global address space across whole machine

o Supports ways to specify data layout

o Co-array Fortran

§ SPMD (single programme multiple data)

§ Implicitly asynchronous

o UPC is an extension of C (not C++)

§ Adds notion of shared variables and UPC threads

§ Also adds control expressions such as upc_forall

§ Gives some mechanisms for exploiting fine-grained parallelism

o Titanium is a Java PGAS

§ Compiled, not interpreted

o HPCS languages

§ DARPA funded

§ Has produced Fortress (Sun), X10 (IBM) and Chapel (Cray)

§ All are “global view” languages with notions of locality

§ More abstract than UPC and Co-array Fortran because you don’t have to specify a fixed number of processes

§ Fortress funding has been stopped but it’s been open sourced (http://fortressproject.sun.com)

o Hybrid programming used to mean several things. Here used as a mix of OpenMP and MPI

o Can this be portable?

o Large systems are almost always hierarchical in structure these days

§ MPI everywhere method (e.g. MPI process per core/thread)

§ Fully hybrid (e.g. one MPI process per server, one OpenMP thread per core/thread)

§ Might want something in between for best performance

o OpenMP 3.0 adds task parallelism to the existing data parallelism constructs

o Useful tools for looking at hybrid programmes:

§ Jumpshot

§ Kojak

o Tried a bunch of OpenMP/MPI hybrid programmes across three very different machines (regular cluster, BG/L, SiCortex)

o Good OpenMP book by Barbara Chapman et al on “Using OpenMP: portable shared memory parallel programming”, MIT Press, 2008

o Still very challenging to balance programmer productivity with software performance – generally simpler code runs more slowly. Same goes for portable code – it often sacrifices performance on a particular platform to gain portability

Optimization, Current Issues, Hands-on Download (20 min Alice)

o Can download some example codes from NERSC

o http://www.nersc.gov/~akoniges/CODES/

o E.g. parallel_jacobi.tar

o It’s common to suffer from poor performance because of incorrect initialisation of OpenMP or MPI (flags etc)

o The “View from Berkeley”’s “seven dwarfs” have been renamed “parallel motifs”

o Parallelmotifs.org

o A place to exchange parallel programmes (useful!)

Topics from MPI

William Gropp, University of Illinois at Urbana-Champaign

o MPI2 added parallel I/O, one-sided messaging and dynamic process interaction

o Current version is MPI2.2 (Sep 2009) http://www.mpi-forum.org/docs

o MPI 3.0 is under development

o Undiscovered (standard) MPI:

o mpiexec (as opposed to mpirun) is a standard part of all MPI implementations are provides a standard way of starting up any MPI programme (should work within a batch/queuing system too)

o MPI_Comm_split allows you to create a new mapping of processes to processors

o MPI enables various profiling tools

§ E.g. for MPICH2:

· SLOG/Jumpshot – timeline visualisation

· FPMPI – summary statistics

· Collcheck – runtime checking of consistency in use of, for example, collective calls

o All MPI functions have:

§ A “P” version (e.g. PMPI_Send) which can be used for profiling, and

§ An “S” version (e.g. MPI_SSend) which is a synchronous version that helps narrow down run-time bugs by matching up corresponding MPI_Receives

o Try to defer synchronisation

§ I.e. transfer data then synchronise afterwards

o Most MPI applications use more synchronisation than they need to

o Use nonblocking operations (MPI_Irecv, MPI_Isend etc) and MPI_Waitall to defer synchronisation

o Try not to enforce an execution order if not strictly necessary

o Make sure you use MPI_Wtick if you are timing your MPI code with MPI_Wtime

o Use MPI_Sendinit et al – often faster as it pre-initialises parts of the communications

o Using RMA can also save a lot of performance on hardware that supports remote memory access (but not all MPI implementations do this well, e.g. IBM)

o PETSc is a good example of an application-level library that uses MPI to hide most of the horrible details of the parallelism from the end user

How to Design Real Applications (30 min David)

o Unstructured meshes require connectivity lists and larger numbers of ghost zones (halos)

o Keep communication local – global comms kills scalability

o Turn some comms into compute as we can scale compute better than communications

o To scale well (slide 266):

o Never access the same file from all processors

o Never search the file system

o Remove any code that depends on the number of partitions

o OpenMP has dynamic memory management but MPI currently does not

o Important for Adaptive Mesh Refinement (AMR) techniques

o Might cause you to use a hybrid OpenMP/MPI approach

o Single process/node optimisation is still important

Hybrid Programming (1 hr Rolf)

o MPI + OpenMP on SMP nodes

o Some (but not all) of the Multi-Zone NAS parallel benchmarks can be implemented efficiently in a hybrid manner (p314)

o Switching on OpenMP compiler optimisations may switch off other compiler optimisations

o When using a hybrid OpenMP/MPI model, reserving n of m threads for MPI communication to overlap with OpenMP communication causes problems because OpenMP expects to have all m threads for compute (not the subset n)

o Intel has a distributed memory OpenMP called Intel Cluster OpenMP

o Only works well for smallish systems with very lightweight communications requirements (i.e. not in many cases)

Tutorial Day 2 - Hybrid MPI and OpenMP Parallel Programming

Rolf Rabenseifner, Georg Hager, Gabriele Jost

o Major programming models:

o Pure MPI (one MPI process on each core)

o Hybrid programming (MPI+OpenMP)

§ Shared memory within an SMP node, MPI between SMP nodes]

o Within hybrid programming, two approaches:

o No overlap of comms & compute (master only uses MPI), or

o Overlap comms & compute

o With the pure MPI model:

o Need to make sure your MPI is installed correctly so that it is takes advantage of SMP between SMP cores

o Does the application topology fit on your hardware topology?

o Hybrid (master only) model:

o No message passing within SMP nodes

o No topology problem

o But all other threads are idle while the master thread communicates

o Would this maximise inter-node bandwidth?

o Case studies: pure MPI vs. Hybrid MPI+OpenMP

o The multi-zone NAS parallel benchmarks (for fluid dynamics)

o Can be freely downloaded from: http://www.nas.nasa.gov/Resources/Software/software.html

o Different sizes of grid points per zone, from 304x208x17 to 4224x3456x92

o Examples of a block tridiagonal, LU decomp and scalar pentadiagonal simulated CFD applications

o Block tridiag and LU are good candidates for MPI+OpenMP

o Pure MPI better for scalar pentadiagonal

o OpenMP and MPI currently both have a restrictions that they don’t allow the programmer to control the mapping of threads to cores

o May be able to use numactl to control mapping of processes to threads/cores/sockets

o "Golden Rule" of ccNUMA:

§ A memory page gets mapped into the local memory of the processor that first touches it!

§ Except if there is not enough local memory available

§ Caveat: "touch" means "write", not "allocate"

§ It is sufficient to touch a single item to map the entire page

§ May want to use memalign() rather than malloc() to align buffers to pages

o OS uses part of main memory for disk buffer (FS) cache which fills up local memory, can results in ccNUMA pages being allocated remotely when the user expects them to be remote. Need to clean up memory periodically to avoid gradual performance degradation caused by this issue

o Avoid barriers whenever possible

o Using OpenMP may turn off other compiler optimisations which loses a lot of performance (check your compiler details)

o The Intel thread checked is one of the most useful tools for debugging thread correctness

OpenCL tutorial

Tim Mattson (Intel Corporation), Ian Buck (NVIDIA), Mike Houston (AMD), Ben Gaster (AMD)

o OpenCL is all about heterogeneous computing

o Multiple CPUs and GPUs within a single platform

o Want to be able to use all these available resources within a single programming model

o Microprocessors becoming many more cores and increasingly heterogeneous

o OpenCL has many interesting companies involved

o GPU companies

o Cell phone companies

o Embedded processor companies

o OpenCL is useful from cell phones to supercomputers

o “Embedded profile” integral to the standard

o Opened up to the Khronos standards group in June 2008

o Released in December 2008

o Conformance tests released in May 2009

o OpenCL is already built-in to Apple’s Snow Leopard (OS X 10.6)

o IBM has OpenCL for their POWER architecture

o Nvidia and AMD have OpenCL already

o Intel will have it within 12 months (according to their speaker!)

o OpenCL 1.1 to be released 1H2010

o OpenCL 2.0 in definition phase now, to be released in 2012

o OpenCL programming model:

o Assumes one host and one or more compute devices

o Each compute device is composed of one or more compute units

o Compute units are further divided into multiple processing elements

o The execution model defines a problem domain and executes a kernel invocation for each point in this domain

§ E.g. if processing pixels, have one kernel invocation per pixel

§ This is predominantly data parallel

o Each kernel invocation is called a “work item” (avoided use of threads, tasks etc.)

o Work items are grouped into “Work Groups” and synchronisation is allowed between work items within work groups. Synching not allowed between work items in different work groups

o Data parallel functions are executed for each work item

o Four different levels in the memory hierarchy:

§ Private memory – local to one work item

§ Local memory – shared within a work group

§ Local Global/Constant – memory – visible to all workgroups

§ Host memory – on the CPU

o Memory management is explicit for transferring between these levels

o Memory consistency is “relaxed”

§ Consistent within a work item

§ Consistent at a barrier within a workgroup

§ Not consistent between workgroups

o OpenCL is derived from ISO C99

§ But doesn’t support function pointers, recursion, variable length arrays or bit fields

§ Adds parallelism, vector types and synchronisation

§ Adds address space qualifiers (for memory hierarchy)

§ Adds optimised image format access (for interworking with OpenGL)

§ Lots of additional built-in functions

o Double precision is an optional part of the standard

o Lots of vector data types and operations that are portable

§ Vector lengths of 2, 4, 8 and 16

§ Most types supported

§ Endian safe

§ Aligned at vector length

§ Vector operations and built-in functions

§ E.g. int4 vi0 = (int4) 19; // a vector of 4 ints all initialised to 19

§ Powerful and expressive vector permutation expressions

§ E.g. int8 v8 = (int8) (vi0, vi1.s01, vi1.odd);

§ This feature alone makes OpenCL worthwhile!

o Contexts and queues:

§ Used to manage the state of the world for the OpenCL programme

§ Include devices, kernels, program and memory objects

§ Similar to MPI, OpenGL etc.

§ Command queues coordinate execution of kernels

§ Queued in order, potentially executed out of order

§ Events are used to synchronise execution instances

o All OpenCL programmes have two parts: kernel code and host code

o Run-time compilation of kernels (when the target is known)

o Supports both task and data parallelism

o Also possible to run native C/C++ functions not compiled using the OpenCL compiler (called “native kernels”)

o In summary: OpenCL looks to be a lot further along in its development than I’d realised, and seems to be coming along well. Real codes are being ported now, and while it tends to be more verbose than Cuda, it’s cross-platform, and does come with the option for C++ bindings that hide more of the detail and bring OpenCL closer to Cuda’s ease of use.