ISC'09 day 3
International SuperComputing 2009 (ISC '09)
- ISC'09 is second the largest supercomputing conference on the calendar, and the largest outside the US.
-
The following notes are pretty much my live transcription as the conference unfolds so please forgive any typos, acronyms not explained etc. I hope you find this useful and/or interesting, please don't hesitate to get in touch if you have any questions!
Conference website.
See also day 1 and day 2 of the conference.
Last day of the conference. Apologies that these write-ups have backed up but I wanted time to re-read them before sending!
First session: More Moore or Multi Trouble?
Multicore/Manycore: what should we demand from the hardware? (slides)
Yale Patt, University of Texas (an excellent speaker and widely regarded as a chip architecture guru)
- Moore's Law in the future will mean doubling the number of cores on the chip? Maybe not
- Processors will be heterogeneous
- PentiumX/NiagaraY - i.e. a few of sophisticated serial cores and lots of lightweight simple cores
- Instruction Set Architectures (ISAs) aren't important anymore
- 10nm is around 100 Angstroms so not many atoms in future transistors!
- 3D (stacked) chips are giving the effect of larger (2D) chips
- Processors started out as mostly logic for the core, but now mostly cache
- Interesting observation about hardware naturally being parallel - transistors all active concurrently etc
- Still lots of improvement to come from branch predictors and cache architecture
- Instruction Level Parallelism (ILP) is not dead or "done"
- 50 billion transistors on chips soon (1-2B today on largest x86)
- Transformation problem - from problem expressed in natural language to electrons moving on a chip
- Power and bandwidth are blocking issues
- With billions of transistors you can afford to have specialist cores lying around not being used very often but making a big difference when they're needed
- Important they don't use much power when not in use though
- People think sequentially (but parallelism doesn't need to be hard)
- It's not OK to only understand one layer of the software stack (controversial?)
- Need to tackle soft errors and security too
Multicore/Manycore: what can we expect from the software (slides)
Kathy Yelick, Director of NERSC & LBNL & Berkeley (another really good speaker)
- DRAM density growing more slowly than processors (doubles every 3 years rather than 2)
- MPI working well for now but won't scale long term, certainly for on-chip many core
- Strong vs. weak scaling (we've been relying on weak scaling but it's drying up)
- Heterogeneity (in processors) will definitely happen
- OpenMP and MPI combined can be hard, can lead to the Amdahl's Law "trap"
- Not easy to express memory hierarchy in OpenMP
- PGAS / DMA languages & autotuning for many & multicore
- http://en.wikipedia.org/wiki/PGAS
- Expresses memory hierarchy
- E.g. UPC, X10, Titanium, Fortress, Chapel, Co-array Fortran
- Things software should do:
- Avoid unnecessary bandwidth use
- Need to address "Little's Law" - have to have a lot of concurrency to mask latency to data
- Evidence that many apps aren't as BW limited as thought (use caches well etc)
- Use novel hardware features through code generators / autotuning
- Avoid unnecessary global synchronisation
- E.g. PLASMA on shared memory, UPC on partitioned memory
- Avoid unnecessary point to point communication
- Software does have to deal with faults (unfortunately)
- Introduces inhomogeneity in execution rates, error correction not instantaneous
- Use good algorithms!
- Avoid communication (expensive), not FLOPS (which are cheap)
- Memory ops more expensive than FLOPS
- Have to be careful about changes in the numerics though
GPU session
Throughput computing: hardware basics (slides)
Justin Hensley, AMD
- Need lots of compute just for current graphics in games
- Graphics is a very data parallel problem
- Latest AMD GPUs are 800-way parallel!
- 1.2 TFLOPS single precision, 240 GFLOPS double precision, 115 GBytes/s memory bandwidth
- More than the best Nvidia platforms, but they're currently lacking a good programming model
- Should solve this when they make OpenCL available later this year
- Said there will be an OpenCL compatible combined CPU+GPU this summer
HPC acceleration: challenge & opportunity in the multicore era
Nash Palaniswarmy, Throughput Computing, Intel
- Says Intel is headed towards heterogeneous many core solutions
- Intel has a language called Ct for data parallel computing (but it's proprietary)
- OpenMP and Intel Thread Building Blocks (TBB) for task-level parallelism
- Didn't really say anything new or exciting
- As an aside I get a consistent message out of Intel that their upcoming graphics architecture, Larrabee, will be focused only on graphics, even though one of its main benefits is that it's general purpose many-core. Frustrating that they won't support users who want to try it for GPGPU-like compute, as Nvidia and AMD are
Parallel computing with CUDA (slides)
Massimiliano Fatica, Nvidia (a former colleague of mine from ClearSpeed)
- Performance doubling roughly every 18 months
- 240 cores in latest GPU, ~1 TFLOP single precision (currently only ~90 GFLOPS for double precision)
- Supporting C for CUDA, OpenCL and Fortran via PGI (the latter being particularly interesting)
- Works across Windows, Mac OS X and Linux
- This is the widest language support of any current GPU by quite a long way
- Over 60,000 Cuda SDK downloads so far
- Up to 5.5 GBytes/s BW between CPU and GPU
- Now supports asynchronous data transfer, so can be overlapped with compute
- Up to 200 GFLOPS on single precision FFTs now (CUFFT 2.3), up to 70 GFLOPS DGEMM
- Have great results speeding up Molecular Dynamics for such as Folding@Home
- But only after the code had been completely re-written to take advantage of this kind of architecture, and Molecular Dynamics is a naturally good fit!
- Some good results on Lattice Boltzmann
- Hit 1.2 TFLOP LINPACK in 8 nodes (each with 2 GPUs)
- This should improve by much more than 2X next year
- Heterogeneous programming model is important to consider
Semantics consistent parallelism (slides)
Li-Yi Wei, Microsoft
- Want to be able to harness multi-core and GPUs
- Sequential consistency is about parallel algorithms giving the same results as a sequential algorithm
- Not all algorithms need this though - e.g. lots of graphics problems, random number generation etc
- "Similar" might be good enough and might give good benefits (by exploiting parallelism)
- Pseudo random number generators are naively sequential but can be parallel
- E.g. random hash (cryptographic?) across a linear sequence
- Poisson disk sampling is an interesting application in this area
- Samples computed in parallel but need to be uniformly distributed
- Just discretise to a grid and compute cells in parallel (doesn't sound that clever to me?)
- Can also use for image editing/morphing [Perez et al SIGGRAPH 2003]
- Also Farbman et al SIGGRAPH 2009
- I highly recommend checking this out:
- Showed some good demos! (Movies available at the URL above)
Hot Seat Session 2
HPC for everyone
Robert Murphy, open storage, Sun
- In 2005 a rack had 84 cores and hit 500 GFLOPS
- Today 9 TFLOPS per rack (18X)
- 2011 172 TFLOPS per rack? (seems a bit high to me!)
- CERN will be generating 27 TBytes/day from LHC when running live
- Genome sequencers and microscopes generating 1 TByte per day per investigator
- Potentially hundreds of thousands of these around the world
- Data access time can dominate HPC workflow
- Not that interesting a talk
- Wouldn't answer questions about the future of Sun, their hardware etc
Energy-efficient computing: challenges and solutions
Raphael Wong, Supermicro
- Have shipped more than 15,000 1U 4-way servers to US national labs
- Shipped more than 500,000 servers last year
- 50% of high-end medical systems in the world based on Supermicro
- Can now put two twin socket servers in a 1U enclosure
- First to provide twin GPU in 1U servers (I like the look of these a lot)
- 93% efficient power supplies
- 90% efficient at light load (25%)
- This is quite unusual today and impressive
- Can get around 12 TFLOPS per rack of blades today, doubling shortly
- Looking at water cooling as an option to increase density further
- When Supermicro do something it's generally a sign that it's ready for the mainstream, so this is an interesting development
New developments & trends in scale-out computing
Frank Baetke, HP Europe
- HP systems account for over 42% of systems in the Top500 (207 are HP blade-based systems)
- SE2210 2U servers will take several GPUs
- Optimising servers to improve airflow and reduce component counts
- E.g. SL2x170z G6 newly announced
- HP Proliant z6000 chassis also looks very interesting
- Does away with some of the fans, perforated metal skins to reduce weight and cost etc
- Common power supplies and larger, more efficient fans for cooling
- Expect cloud-like approaches to be adopted to provide HPC
- I.e. put the datacentres in locations with cheap power, cooling, real estate etc.
Scalable architecture for the many-core era
Eng Lim Goh, SVP & CTO, sgi (one of the more interesting talks in this vendor section)
- MPI collectives limit Fluent (CFD) scalability at 1024 cores
- mpi_allreduce (most costly), waitall, reduce, barrier, recv
- Spending more time communicating than computing at these scales
- So looking at accelerating the comms
- Supporting global addressable memory up to 8 PetaBytes (!)
- Based on Intel quad core (Nehalem)
- MPI offload engine in their chipset hardware
- Barrier, reduce, DMA gather/scatter in hardware for accelerating MPI
- Get one per two x86 sockets
- 53-bit addressing to support global addressable memory
- Recommend a 2D torus to keep inter-node BW up
- Looks really interesting, though proprietary (so probably expensive)
From PetaScale to Clouds - providing the ultimate networking solution for high performance
Michael Kagan, CTO, Mellanox
- The new 1 PFLOP system at Juelich uses the latest 40 Gbps InfiniBand
- Achieved 91.6% efficiency on LINPACK, one of the highest efficiencies achieved so far
- Not very exciting talk
- Mentioned http://www.hpcadvisorycouncil.com/
Myri-10G: 10-Gigabit Ethernet with a supercomputing heritage
Markus Fischer, Myricom
- Basically they've plugged 10 GigE Ethernet phys (layer 1) to the Myrinet network protocols (layer 2)
- Includes features for communication offload and kernel bypass operations
- Uses the Lanai Z8ES 10GE NIC (PCI Express x8 with dual 10 GE ports)
- Claim this is currently the fastest, lowest power, cheapest on market today
- Have products compatible with IBM's blades
- Can achieve 18.9 Gbits/s throughput with a 1,500 byte message size (MTU) - 20Gbps peak so a good result
- Low power consumption: 3.3W for 10GE port
- Have sophisticated, stateless firmware for comms offload, assisting the host CPUs
- Believe 10GE will become much more popular in HPC
Enhanced scalability with QLogic InfiniBand
Philip Murphy, VP Engineering, QLogic
- I have to confess I glazed over during this talk, not terribly exciting
High Performance, non-blocking reliable switching for HPC
Frank Laforsch, Force10
- Their stuff is used in many of the fastest systems in the Top500, including the #1 system, Roadrunner
- Designed for high reliability
- Their products use 36W per 10GE port
- Also not an exciting talk
NEC's future directions
Rudolf Fischer, CTO, NEC Germany
- Said almost nothing about future directions
- Believes vector is now ubiquitous (i.e. SSE in x86 CPUs) and classic style will also come back
- I don't agree with vector coming back in the Cray/NEC classic style myself, don't think the economics make sense
- But SIMD/SSE small vectors in x86 will keep growing, and GPUs are also vector like
- Latest NEC vector machine is the SX-9
- Has a 512-bit wide floating point register file
- Also has 256 GBytes/s memory bandwidth
- Even higher than a GPU, very impressive
- Supports gather/scatter in hardware
- Said he doesn't like GPUs because they are so loosely coupled to the host CPU
- But integrating them should solve that I believe
- Part of the PRACE Europe-wide supercomputer investigation with HLRS in Germany
That's all from the ISC09 3-day conference, I hope these write-ups were interesting and useful. Do let me know if you liked them or if you have any suggestions for future such conference reports. Thanks for the feedback so far!
|