SIAM PP10 Conference, Seattle, February 24-26 2010

 

http://www.siam.org/meetings/pp10/

 

Brad Chamberlain, Cray, Inc., ÒHPC Programming Models: Current Practice, Emerging PromiseÓ

 

Taxonomy of programming languages 2010:

o   Comms libraries (MPI)

o   Shared memory (OpenMP)

o   GPU programming models (Cuda, OpenCL)

o   Hybrid models (OpenMP+MPI)

o   Traditional PGAS (UPC, Co-Array Fortran)

o   HPCS (Chapel, X10, Fortress)

o   Others

What is MPI really good for? Not ideal at anything really. Not likely to last in the long term.

MPI 3.0 underway right now: improved resilience, hybrid computing, asynch collectives, improved scalability.

Multiresolution languages: some are quite low-level and let you get really close to the machine (e.g. MPI, pthreads). Others are higher-level – HPF, ZPL, but these donÕt let you bypass the higher level abstractions and let you get close to the machine when necessary.

Chapel is designed to have a spectrum of features from the low level to the high level.

PGAS isnÕt shared memory, itÕs shared namespaces, an important distinction! Has a strong sense of ownership on variables – local and remote variables. E.g. UPC, Co-Array Fortran, Titanium, several more. Traditional PGAS is almost entirely SPMD today (Single Programme Multiple Data).

Chapel is starting to target GPUs now – just have to express parallelism and locality, the compiler/run-time takes care of everything else. Goes beyond SPMD parallelism, an advantage over traditional PGAS languages.

Two main thrusts today: evolutionary based on MPI 3.0 and OpenMP 4.0, and revolutionary, based on Chapel/X10 etc.

 

Jonathan Berry, Sandia National Laboratories, ÒGraph Analysis with High-Performance ComputingÓ

 

Graphs are sets of vertices and edges. Edges are Òmeta dataÓ. Applications in homeland security, biology etc.

Crunching graphs doesnÕt work well on traditional commodity architectures because it doesnÕt have much spatial or temporal data locality – better suited to massively multi-threaded architectures with almost no memory hierarcy (close to randomly accessing memory).

CrayÕs ÒThreadstormÓ XMT design is more like this – no processor caches, instead use chip real estate for thread contexts. Becomes massively threaded and latency tolerant, suited for graph algorithms.

 

Leonid Oliker et al, LBNL, ÒHierarchical Auto-Tuning of a Hybrid Lattice Boltzmann ComputationÓ

 

This is a NERSC research project. They have a 4-way Opteron-based cluster.

Want to be able to optimise for multiple different architectures – labour intensive, so need this to be automatic.

This is a Òsevern dwarfsÓ approach, optimising multiple kernels across entire family of architectures.

Being driven by LBMHD – Lattice Boltmann Magnetohydrodynamics (CFD + MaxwellÕs equations).

Essentially this is a stencil operation. Used for plasma turbulence simulation.

One thing autotuning can do is maximise cache efficiency by automatically padding as appropriate.

Can also autovectorise to reuse data in the stencil operations, reduce TLB misses etc.

Have extended their auto-tuner to now target OpenMP+MPI – distributed memory for the first time.

Search space for best configuration of optimisations is too large to search exhaustively. They use a 3 stage greedy algorithm instead.

Cuboid data partitioning isnÕt always best, because while this maximises surface to volume ratio, it may not maximise cache performance.

Takes 10-12 hours to do the tuning, the calculation then runs for days, so the trade-off is worth it. Not necessarily always the case though.

Getting close to maximum theoretical performance per node, have still to look at this in the distributed memory case though.

 

Aparna Chandramowlishwaran, Georgia Institute of Technology, ÒOptimizing and Tuning the Fast Multipole Method for Multicore and Accelerator SystemsÓ

 

This work focuses on single node tuning.

Fast multipole is O(n) on n particles, compared to O(n^2)  for direct methods and O(nlogn) for Barnes-Hut.

Two steps: build tree then evaluate potential on n targets.

Recursively divide tree until each space has no more than q points.

Tuning included optimising for FFTW plans, SIMDization, Newton-Raphson approximation, structure of arrays and matrix free computation (whatÕs this?)

Had an interesting result that an Intel Nehalem was as fast as a GPU, but really didnÕt understand why this was the case.

 

Haoqiang Jin, NASA, ÒHigh Performance Computing Using MPI and OpenMP in Multi-core and Many-core EnvironmentsÓ

 

90% of codes at NASA still written in MPI. Used for performance and portability, but far from ideal.

Cart3D uses MPI and OpenMP (either/or, not hybrid). Achieved close to ideal scaling up to 512 cores.

Often have problems with machine stability on large jobs bringing MPI down.

OpenMP 3.0 being extended to address locality (currently assumes itÕs a flat memory space).

 

Darren Kerbyson, LANL, ÒExperiences from the Roadrunner Petascale Hybrid SystemÓ

 

Darren was part of Stephen JarvisÕs performance modelling team at Warwick.

Did a lot of the initial performance modelling for Roadrunner (Cell-based) before it was procured.

Also modelling Blue Waters ahead of its installation at UIUC.

 

Jack Dongarra talk

 

Exascale programme began three years ago in the US. DOE funded. Sequence of workshops looking at building a science case for EXAFLOPS.

The current #1 system, Jaguar at ORNL is 2 PFLOP peak, 6MW just for the machine, not including cooling! 225,000 cores, MTBF a few days.

Want an EXAFLOP machine capped at 20MW, $200M. System concurrency is 10,000 times more than the largest systems today!

The memory capacity and memory bandwidth bottlenecks will tighten by an order of magnitude.

The average core count in the top 20 systems of the Top 500 is 90,000 already.

Need to break the bulk synchronous processing traditional paradigms.

 

JackÕs next talk, ÒCommunication avoiding algorithms in PLASMA and MAGMAÓ

 

DonÕt forget that bandwidth is improving much more slowly than compute.

Benefits of these algorithms are for more skinny matrices in things like QR. Squarer matrices negate a lot of the benefits.

Have done some interesting work proving lower bounds for FLOPS, bandwidth and messages.

 

Mike Giles, Oxford, ÒComputational finance on GPUsÓ

 

Computational finance accounts for 10% of the systems in the Top500 (!) These are primarily used for throughput computing – vast numbers of small, independent tasks.

Financial institutions want to increase their simulations by an order of magnitude over the next year to do better risk management.

Computational financing covers:

o   Options pricing – investment banks – Monte Carlo (60%), PDE (30%), other semi analytic (10%)

o   High prefquency algorithmic trading (not computationally intensive)

o   Actuarial science – pension funds, insurance

A Ôcall optionÕ give you the right (not an obligation) to buy something at a fixed price.

Stochastic differential equations (SDEs) used to model the behaviour of stock prices etc.

Geometric Brownian motion (Black-Scholes) used to model stock prices.

Might do 100,000 path simulations to value a single option.

Monte Carlo is trivially parallel – each path needs an independent set of random numbers. This is expensive because banks have to analyse their position over night, every night. Also have to constantly assess the sensitivity of the value to changes in various parameters, such as changes in interest rates. The aim of the banks is to hold offsetting risk so they donÕt gamble.

PDEs are an alternative approach. Solve a Black-Scholes PDE (geometric Brownian Motion) backwards in time from the payoff value. These are simple, scalar PDEs.

PDEs suffer the Òcurse of dimensionalityÓ. Cost of a Monte Carlo is linear in dimension, but this same cost is exponential in PDEs. DonÕt see many PDEs larges than 3 dimensions because of this.

Bloomberg has a large GPU cluster – 48 Teslas each with 4 GPUs. For them this was instead of buying 2000 CPUs. BNP Paribas has a small cluster. Many more doing proof of concept trials.

If you have a background in MPI+OpenMP learning Cuda isnÕt too difficult.

A key issue in parallel RNGs is having a skip-ahead capability. Can jump N places in O(logN) operations.

On-chip caches in future GPUs will make PDEs much easier – automates the data reuse of halo cells etc.

Looking forward to greater C++ support in next-gen GPUs since most finance codes are written in this language.

 

Keren Bergman, Columbia U, ÒPhotonic on-chip networks for performance-energy optimized computing systemsÓ

 

Lightwave research lab, Columbia U.

On-chip networks can use 30-50% of chip power budget (not sure I believe this, it was a tiny fraction on ClearSpeedÕs processors).

Off-chip comms is certainly a big power hog.

ItÕs typical for a processor to have 10X more intra-chip bandwidth than inter-chip.

Photonics may change the rules for bandwidth per watt. Once bits have been modulated they can be transferred as far as you want, either on-chip or off. Off-chip BW can be the same as on-chip BW (not sure I believe this either!).

Having 3D memory layers may benefit from also have layers of photonic network on chip.

Their group is looking at BW for exascale systems, designing a chip with 10 TBytes/s on-chip and off-chip BW.

Photonics can be integrated in CMOS processes, including things like optical waveguides (nano-wires).

First complete optical link in Aug 2009 (with collaborators at Cornell).

Can easily build things like a torus on-chip.

Have a simulator called Phoenix-sim about to be publically released.

They can do 38 10Gbps links at 3.2W total. (Very low power!) (simulation)

 

Michael Heroux, Sandia, ÒChallenges and progress in effective use of scalable multi-core and many-core systemsÓ

 

All current exascale system designs are exascale in FLOPS only, leaving memory size and bandwidth behind.

 

Hidetoshi Ando, Yamanashi University, ÒPerformance evaluation of preconditioned Krylov subspace methods on GPUÓ

 

First Conjugate Gradient (CG) solutions on GPUs appeared in 2003 as simple shaders – took heroic efforts!

Have 10X speedup on a GPU vs optimised code on 4 core host (ported ICCG).

 

Alan Williams, Sandia, ÒMulticore and Manycore performance studies using a finite element mini-applicationÓ

 

MiniFE assembles a sparse linear-system, solving it using a conjugate gradient algorithm.

Sandia has a huge investment in MPI-parallel, unstructured-mesh, finite-element applications.

MineFE is available as open source, download from the Mantevo website – http://software.sandia.gov/mantevo

 

Rick Molloy, Microsoft, ÒA closer look at the parallel pattern library, asynchronous agents library and concurrency runtime in Visual Studio 2010Ó

 

The parallel pattern library is written in C++ for data and task parallelism (async).

Have partnered with Intel to use their Threaded Building Blocks (TBB).

Looks quite interesting, but of course itÕs OS specific!

 

Anthony Lee, Oxford, ÒMassively parallel population-based Monte Carlo methods with many-core processorsÓ

 

This is a class of MC methods particularly well suited to parallelisation on shared memory systems.

Include SMC samplers, particle filters and population-based Markov Chain Monte Carlo (MCMC).

http://www.oxford-man.ox.ac.uk/gpuss

 

Jonathan Cohen, Nvidia, ÒOPenCurrent: solving PDEs on structured grids with CudaÓ

 

Open source – code.google.com/p/opencurrent (Apache 2.0 license)

A general library for solving PDEs over structured grids. Initially targeting fluid simulations but itÕs quite general.

Has been balancing ease-of-programming vs. performance.

Want to be able to work from a high-level algorithmic description.

Worked hard to avoid all serial bottlenecks.

Have implemented a synchronous co-array model to support multiple GPUs.

Now as full incompressible Navier-Stokes working on two GPUs. Should scale by about 90% on the two GPUs.

Going to add kernel generation via C++ template meta-programming. Can make programming much easier and doesnÕt have to cost much performance.

 

Mike Giles, Oxford, ÒA framework for parallel unstructured grid applications on GPUsÓ

 

PDEs are important in a whole variety of applications.

Want a suitable level of abstraction to separate the userÕs specification of the app from the details of the parallel implementation. Aim to achieve code longevity and near-optimal performance through re-targeting the back-end to different hardware.

This work is in collaboration with Paul Kelly at Imperial.

Mike produced OPlus unstructured solver for Rolls-Royce 10 years ago. MPI-based fo HYDRA CFD codes on clusters of up to 200 nodes.

OP2 is an open source project. ItÕs an Òactive libraryÓ using code transformation to generate Cuda, OpenCL or OpenMP/AVX code for GPUs and CPUs.

Has abstractions for sets (nodes, edges, faces), datasets (flow variables), pointers (e.g. from edges to nodes), and parallel loops. These operate over all members of one set. At most one level of indirection. User specifies how data is used ( read only, write only or increment).

No dynamic grid adaptation.

Potential hazard when incrementing indirectly-referenced unstructured grids. Can solve this by giving edges owners, and edges with two owners get incremented twice. Alternatively can colour edges so that no two edges of the same colour update the same node. This avoids data conflicts by construction. A third option is to use atomic updates. Current hardware support for this is slow though.

Has a prototype running today.

 

Horst Simon, LBNL, ÒEnergy efficient computing – from bits to buildingsÓ

 

Have an Òenergy efficient computingÓ webpage on LBNL – worth looking up!

Check out the energy ÒspaghettiÓ chart. 55% of all energy is just lost, half of this lost in electricity transmission.

Two interrelated issues: ÒbitsÓ and ÒbuildingsÓ.

http://www.energy.ca.gov/comission/comissioners/rosenfeld_docs/index.html

Khazzoom-Brookes postulate – Òwork will fill the available timeÓ. I.e. saving energy at the micro level leads to higher energy consumption at the macro level.

Bit IT responsible for $16B/year energy cost, 200TWhours/year. Paper from Jonathan Koomey from Stanford in 2006 estimating total power consumption by servers in the US and the world.

ÒSMART 2020: enabling the low carbon economy in the information age: - The Climate Group. – global ICT carbon footprint by subsector, geography etc.

It can cost $5-10M to go from a 2 to 3 MW power supply for a supercomputer.

Microsoft has a million server datacentre in Chicago, 400 containers, 300MW power supply!

Datacentre Infrastructure Efficiency (DCiE).  Similar to PUEs. Average is about 1.9, state of the art is about 1.2. But donÕt forget this is just a ratio, not am absolute number!

LBNL is aiming for a PUE of 1.05 for their new data centre.

Dale Sartor is pushing data centre energy efficiency from UC Berkeley.

One of the biggest energy issues in the electronics is in off-chip energy movement – bandwidth costs power.

Mentioned Green Flash as an example as an energy efficient, application-specific supercomputer for climate modelling.

Green Flash is aiming for 100x power efficiency over mainstream HPC approaches.

Power consumption is going to change the economics of computing – might drive more people to used clouds.

 

John Shalf, LBNL, ÒThe architectural requirements driving language evolutionÓ

 

John is one of the main people behind the ÒGreen FlashÓ climate modelling supercomputer design at LBNL.

Concerned with the possibility of 3 disruptive transitions over the next 10 years – would rather just have one!

Sustained to peak flop ratio is the wrong metric if flops are cheap.

The programming language should reflect the new reality of the actual costs of the underlying system

Locality of data is really going to matter (and matter more and more in the future). The energy cost of accessing off-chip memory is 100X higher than on-chip. Caches get in the way of this because they hide the true cost of accessing that data.

Must have data-centric parallel expressions.

Need hierarchical data layout statements. (And must be multi-dimensional).

Fine-grained power management makes homogeneous systems look heterogeneous. Fault resilience can do the same.

Really need to go to asynchronous programming models to scale in the future.

 

Fred Wubs, Gronigen, ÒA solver for bifurcation analysis of ocean-climate modelsÓ

 

Used Trilinos C++ libraries to parallelise their codes. Also GMRESR.

 

Eric Strohmaier, LBNL, ÒParallel motifs as paradigms for comparing language performanceÓ

 

ÒMotifÓ is the new name for ÒDwarfÓ!

Working on Òa pattern language for parallel motifsÓ at the Berkeley ParLab.

Collecting at least one classical example of each dwarf. Defined in the abstract, but comes with input test data and verification of the expected results so you can port a dwarf and test the implementation.

 

Mike Giles, ÒAn introduction to GPU computingÓ

 

Minimum vector length of 32 per SM (8 cores per SM, each of which needs a minimum of 4 threads each). Nvidia calls these 32 threads a ÒwarpÓ. Warps can be active or inactive (waiting for data).

MikeÕs rule of thumb is that you want at least 10,000 threads to run to get anywhere near peak performance on the GPUs. ThatÕs about 40 threads per core.

Memory access from device memory has a delay of 400-600 cycles; with 40 threads this is equivalent to 10-15 operations and can be managed by the compiler.

Code is written from the point of view from an individual thread.

Warp divergence is important to avoid if at all possible – threads have to execute all branches of a conditional statement. ThereÕs a special case called Ôwarp votingÕ thatÕs fairly heavyweight which can coalesce branches at run-time – the compiler inserts additional code to check for this where appropriate.

For codes with boundary conditions, treat them differently depending on the cost of computing those conditions. If theyÕre cheap, just branch as necessary (small overhead). If expensive, have two kernels, one of interior points and a second for boundary points.