ISC'09 day 3

International SuperComputing 2009 (ISC '09)

ISC'09 is second the largest supercomputing conference on the calendar, and the largest outside the US.

The following notes are pretty much my live transcription as the conference unfolds so please forgive any typos, acronyms not explained etc. I hope you find this useful and/or interesting, please don't hesitate to get in touch if you have any questions!

Conference website.

See also day 1 and day 2 of the conference.

Last day of the conference. Apologies that these write-ups have backed up but I wanted time to re-read them before sending!

First session: More Moore or Multi Trouble?

Multicore/Manycore: what should we demand from the hardware? (slides)

Yale Patt, University of Texas (an excellent speaker and widely regarded as a chip architecture guru)

Moore's Law in the future will mean doubling the number of cores on the chip? Maybe not
Processors will be heterogeneous
- PentiumX/NiagaraY - i.e. a few of sophisticated serial cores and lots of lightweight simple cores
Instruction Set Architectures (ISAs) aren't important anymore
10nm is around 100 Angstroms so not many atoms in future transistors!
3D (stacked) chips are giving the effect of larger (2D) chips
Processors started out as mostly logic for the core, but now mostly cache
Interesting observation about hardware naturally being parallel - transistors all active concurrently etc
Still lots of improvement to come from branch predictors and cache architecture
- Instruction Level Parallelism (ILP) is not dead or "done"
50 billion transistors on chips soon (1-2B today on largest x86)
Transformation problem - from problem expressed in natural language to electrons moving on a chip
Power and bandwidth are blocking issues
With billions of transistors you can afford to have specialist cores lying around not being used very often but making a big difference when they're needed
- Important they don't use much power when not in use though
People think sequentially (but parallelism doesn't need to be hard)
It's not OK to only understand one layer of the software stack (controversial?)
Need to tackle soft errors and security too

Multicore/Manycore: what can we expect from the software (slides)

Kathy Yelick, Director of NERSC & LBNL & Berkeley (another really good speaker)

DRAM density growing more slowly than processors (doubles every 3 years rather than 2)
MPI working well for now but won't scale long term, certainly for on-chip many core
Strong vs. weak scaling (we've been relying on weak scaling but it's drying up)
Heterogeneity (in processors) will definitely happen
OpenMP and MPI combined can be hard, can lead to the Amdahl's Law "trap"
- Not easy to express memory hierarchy in OpenMP
PGAS / DMA languages & autotuning for many & multicore
- http://en.wikipedia.org/wiki/PGAS
- Expresses memory hierarchy
- E.g. UPC, X10, Titanium, Fortress, Chapel, Co-array Fortran
Things software should do:
- Avoid unnecessary bandwidth use
- Need to address "Little's Law" - have to have a lot of concurrency to mask latency to data
  - Evidence that many apps aren't as BW limited as thought (use caches well etc)
- Use novel hardware features through code generators / autotuning
- Avoid unnecessary global synchronisation
  - E.g. PLASMA on shared memory, UPC on partitioned memory
- Avoid unnecessary point to point communication
- Software does have to deal with faults (unfortunately)
  - Introduces inhomogeneity in execution rates, error correction not instantaneous
- Use good algorithms!
- Avoid communication (expensive), not FLOPS (which are cheap)
  - Memory ops more expensive than FLOPS
  - Have to be careful about changes in the numerics though

GPU session

Throughput computing: hardware basics (slides)

Justin Hensley, AMD

Need lots of compute just for current graphics in games
Graphics is a very data parallel problem
Latest AMD GPUs are 800-way parallel!
1.2 TFLOPS single precision, 240 GFLOPS double precision, 115 GBytes/s memory bandwidth
- More than the best Nvidia platforms, but they're currently lacking a good programming model
- Should solve this when they make OpenCL available later this year
Said there will be an OpenCL compatible combined CPU+GPU this summer

HPC acceleration: challenge & opportunity in the multicore era

Nash Palaniswarmy, Throughput Computing, Intel

Says Intel is headed towards heterogeneous many core solutions
Intel has a language called Ct for data parallel computing (but it's proprietary)
OpenMP and Intel Thread Building Blocks (TBB) for task-level parallelism
Didn't really say anything new or exciting
As an aside I get a consistent message out of Intel that their upcoming graphics architecture, Larrabee, will be focused only on graphics, even though one of its main benefits is that it's general purpose many-core. Frustrating that they won't support users who want to try it for GPGPU-like compute, as Nvidia and AMD are

Parallel computing with CUDA (slides)

Massimiliano Fatica, Nvidia (a former colleague of mine from ClearSpeed)

Performance doubling roughly every 18 months
240 cores in latest GPU, ~1 TFLOP single precision (currently only ~90 GFLOPS for double precision)
Supporting C for CUDA, OpenCL and Fortran via PGI (the latter being particularly interesting)
- Works across Windows, Mac OS X and Linux
- This is the widest language support of any current GPU by quite a long way
Over 60,000 Cuda SDK downloads so far
- One of them being me
Up to 5.5 GBytes/s BW between CPU and GPU
Now supports asynchronous data transfer, so can be overlapped with compute
Up to 200 GFLOPS on single precision FFTs now (CUFFT 2.3), up to 70 GFLOPS DGEMM
Have great results speeding up Molecular Dynamics for such as Folding@Home
- But only after the code had been completely re-written to take advantage of this kind of architecture, and Molecular Dynamics is a naturally good fit!
Some good results on Lattice Boltzmann
Hit 1.2 TFLOP LINPACK in 8 nodes (each with 2 GPUs)
- This should improve by much more than 2X next year
Heterogeneous programming model is important to consider

Semantics consistent parallelism (slides)

Li-Yi Wei, Microsoft

Want to be able to harness multi-core and GPUs
Sequential consistency is about parallel algorithms giving the same results as a sequential algorithm
Not all algorithms need this though - e.g. lots of graphics problems, random number generation etc
"Similar" might be good enough and might give good benefits (by exploiting parallelism)
Pseudo random number generators are naively sequential but can be parallel
- E.g. random hash (cryptographic?) across a linear sequence
Poisson disk sampling is an interesting application in this area
- Samples computed in parallel but need to be uniformly distributed
- Just discretise to a grid and compute cells in parallel (doesn't sound that clever to me?)
- Can also use for image editing/morphing [Perez et al SIGGRAPH 2003]
- Also Farbman et al SIGGRAPH 2009
- I highly recommend checking this out:
  - http://www.cs.huji.ac.il/~danix/mvclone/
- Showed some good demos! (Movies available at the URL above)

Hot Seat Session 2

HPC for everyone

Robert Murphy, open storage, Sun

In 2005 a rack had 84 cores and hit 500 GFLOPS
Today 9 TFLOPS per rack (18X)
2011 172 TFLOPS per rack? (seems a bit high to me!)
CERN will be generating 27 TBytes/day from LHC when running live
Genome sequencers and microscopes generating 1 TByte per day per investigator
- Potentially hundreds of thousands of these around the world
Data access time can dominate HPC workflow
Not that interesting a talk
Wouldn't answer questions about the future of Sun, their hardware etc

Energy-efficient computing: challenges and solutions

Raphael Wong, Supermicro

Have shipped more than 15,000 1U 4-way servers to US national labs
Shipped more than 500,000 servers last year
50% of high-end medical systems in the world based on Supermicro
Can now put two twin socket servers in a 1U enclosure
First to provide twin GPU in 1U servers (I like the look of these a lot)
93% efficient power supplies
- 90% efficient at light load (25%)
- This is quite unusual today and impressive
Can get around 12 TFLOPS per rack of blades today, doubling shortly
Looking at water cooling as an option to increase density further
- When Supermicro do something it's generally a sign that it's ready for the mainstream, so this is an interesting development

New developments & trends in scale-out computing

Frank Baetke, HP Europe

HP systems account for over 42% of systems in the Top500 (207 are HP blade-based systems)
SE2210 2U servers will take several GPUs
- 84 GPUs per 47(?)U rack
Optimising servers to improve airflow and reduce component counts
- E.g. SL2x170z G6 newly announced
HP Proliant z6000 chassis also looks very interesting
- Does away with some of the fans, perforated metal skins to reduce weight and cost etc
- Common power supplies and larger, more efficient fans for cooling
Expect cloud-like approaches to be adopted to provide HPC
- I.e. put the datacentres in locations with cheap power, cooling, real estate etc.

Scalable architecture for the many-core era

Eng Lim Goh, SVP & CTO, sgi (one of the more interesting talks in this vendor section)

MPI collectives limit Fluent (CFD) scalability at 1024 cores
- mpi_allreduce (most costly), waitall, reduce, barrier, recv
- Spending more time communicating than computing at these scales
So looking at accelerating the comms
- "Ultraviolet" project
Supporting global addressable memory up to 8 PetaBytes (!)
- Based on Intel quad core (Nehalem)
MPI offload engine in their chipset hardware
- Barrier, reduce, DMA gather/scatter in hardware for accelerating MPI
- Get one per two x86 sockets
- 53-bit addressing to support global addressable memory
Recommend a 2D torus to keep inter-node BW up
Looks really interesting, though proprietary (so probably expensive)

From PetaScale to Clouds - providing the ultimate networking solution for high performance

Michael Kagan, CTO, Mellanox

The new 1 PFLOP system at Juelich uses the latest 40 Gbps InfiniBand
- Achieved 91.6% efficiency on LINPACK, one of the highest efficiencies achieved so far
Not very exciting talk
Mentioned http://www.hpcadvisorycouncil.com/

Myri-10G: 10-Gigabit Ethernet with a supercomputing heritage

Markus Fischer, Myricom

Basically they've plugged 10 GigE Ethernet phys (layer 1) to the Myrinet network protocols (layer 2)
Includes features for communication offload and kernel bypass operations
Uses the Lanai Z8ES 10GE NIC (PCI Express x8 with dual 10 GE ports)
- Claim this is currently the fastest, lowest power, cheapest on market today
Have products compatible with IBM's blades
Can achieve 18.9 Gbits/s throughput with a 1,500 byte message size (MTU) - 20Gbps peak so a good result
- Low power consumption: 3.3W for 10GE port
Have sophisticated, stateless firmware for comms offload, assisting the host CPUs
Believe 10GE will become much more popular in HPC

Enhanced scalability with QLogic InfiniBand

Philip Murphy, VP Engineering, QLogic

I have to confess I glazed over during this talk, not terribly exciting

High Performance, non-blocking reliable switching for HPC

Frank Laforsch, Force10

Their stuff is used in many of the fastest systems in the Top500, including the #1 system, Roadrunner
Designed for high reliability
Their products use 36W per 10GE port
Also not an exciting talk

NEC's future directions

Rudolf Fischer, CTO, NEC Germany

Said almost nothing about future directions
Believes vector is now ubiquitous (i.e. SSE in x86 CPUs) and classic style will also come back
- I don't agree with vector coming back in the Cray/NEC classic style myself, don't think the economics make sense
- But SIMD/SSE small vectors in x86 will keep growing, and GPUs are also vector like
Latest NEC vector machine is the SX-9
- Has a 512-bit wide floating point register file
- Also has 256 GBytes/s memory bandwidth
  - Even higher than a GPU, very impressive
- Supports gather/scatter in hardware
Said he doesn't like GPUs because they are so loosely coupled to the host CPU
- But integrating them should solve that I believe
Part of the PRACE Europe-wide supercomputer investigation with HLRS in Germany

That's all from the ISC09 3-day conference, I hope these write-ups were interesting and useful. Do let me know if you liked them or if you have any suggestions for future such conference reports. Thanks for the feedback so far!

See also day 1 and day 2 of the conference
Back to conferences
Back to Simon's home page