<< 2008-9 >>
Department of
Computer Science
 

ISC'09 day 3

International SuperComputing 2009 (ISC '09)

ISC'09 is second the largest supercomputing conference on the calendar, and the largest outside the US.

The following notes are pretty much my live transcription as the conference unfolds so please forgive any typos, acronyms not explained etc. I hope you find this useful and/or interesting, please don't hesitate to get in touch if you have any questions!

Conference website.

See also day 1 and day 2 of the conference.


Last day of the conference. Apologies that these write-ups have backed up but I wanted time to re-read them before sending!

First session: More Moore or Multi Trouble?

Multicore/Manycore: what should we demand from the hardware? (slides)

Yale Patt, University of Texas (an excellent speaker and widely regarded as a chip architecture guru)

  • Moore's Law in the future will mean doubling the number of cores on the chip? Maybe not
  • Processors will be heterogeneous
    • PentiumX/NiagaraY - i.e. a few of sophisticated serial cores and lots of lightweight simple cores
  • Instruction Set Architectures (ISAs) aren't important anymore
  • 10nm is around 100 Angstroms so not many atoms in future transistors!
  • 3D (stacked) chips are giving the effect of larger (2D) chips
  • Processors started out as mostly logic for the core, but now mostly cache
  • Interesting observation about hardware naturally being parallel - transistors all active concurrently etc
  • Still lots of improvement to come from branch predictors and cache architecture
    • Instruction Level Parallelism (ILP) is not dead or "done"
  • 50 billion transistors on chips soon (1-2B today on largest x86)
  • Transformation problem - from problem expressed in natural language to electrons moving on a chip
  • Power and bandwidth are blocking issues
  • With billions of transistors you can afford to have specialist cores lying around not being used very often but making a big difference when they're needed
    • Important they don't use much power when not in use though
  • People think sequentially (but parallelism doesn't need to be hard)
  • It's not OK to only understand one layer of the software stack (controversial?)
  • Need to tackle soft errors and security too


Multicore/Manycore: what can we expect from the software (slides)

Kathy Yelick, Director of NERSC & LBNL & Berkeley (another really good speaker)

  • DRAM density growing more slowly than processors (doubles every 3 years rather than 2)
  • MPI working well for now but won't scale long term, certainly for on-chip many core
  • Strong vs. weak scaling (we've been relying on weak scaling but it's drying up)
  • Heterogeneity (in processors) will definitely happen
  • OpenMP and MPI combined can be hard, can lead to the Amdahl's Law "trap"
    • Not easy to express memory hierarchy in OpenMP
  • PGAS / DMA languages & autotuning for many & multicore
    • http://en.wikipedia.org/wiki/PGAS
    • Expresses memory hierarchy
    • E.g. UPC, X10, Titanium, Fortress, Chapel, Co-array Fortran
  • Things software should do:
    • Avoid unnecessary bandwidth use
    • Need to address "Little's Law" - have to have a lot of concurrency to mask latency to data
      • Evidence that many apps aren't as BW limited as thought (use caches well etc)
    • Use novel hardware features through code generators / autotuning
    • Avoid unnecessary global synchronisation
      • E.g. PLASMA on shared memory, UPC on partitioned memory
    • Avoid unnecessary point to point communication
    • Software does have to deal with faults (unfortunately)
      • Introduces inhomogeneity in execution rates, error correction not instantaneous
    • Use good algorithms!
    • Avoid communication (expensive), not FLOPS (which are cheap)
      • Memory ops more expensive than FLOPS
      • Have to be careful about changes in the numerics though


GPU session

Throughput computing: hardware basics (slides)

Justin Hensley, AMD

  • Need lots of compute just for current graphics in games
  • Graphics is a very data parallel problem
  • Latest AMD GPUs are 800-way parallel!
  • 1.2 TFLOPS single precision, 240 GFLOPS double precision, 115 GBytes/s memory bandwidth
    • More than the best Nvidia platforms, but they're currently lacking a good programming model
    • Should solve this when they make OpenCL available later this year
  • Said there will be an OpenCL compatible combined CPU+GPU this summer


HPC acceleration: challenge & opportunity in the multicore era

Nash Palaniswarmy, Throughput Computing, Intel

  • Says Intel is headed towards heterogeneous many core solutions
  • Intel has a language called Ct for data parallel computing (but it's proprietary)
  • OpenMP and Intel Thread Building Blocks (TBB) for task-level parallelism
  • Didn't really say anything new or exciting
  • As an aside I get a consistent message out of Intel that their upcoming graphics architecture, Larrabee, will be focused only on graphics, even though one of its main benefits is that it's general purpose many-core. Frustrating that they won't support users who want to try it for GPGPU-like compute, as Nvidia and AMD are


Parallel computing with CUDA (slides)

Massimiliano Fatica, Nvidia (a former colleague of mine from ClearSpeed)

  • Performance doubling roughly every 18 months
  • 240 cores in latest GPU, ~1 TFLOP single precision (currently only ~90 GFLOPS for double precision)
  • Supporting C for CUDA, OpenCL and Fortran via PGI (the latter being particularly interesting)
    • Works across Windows, Mac OS X and Linux
    • This is the widest language support of any current GPU by quite a long way
  • Over 60,000 Cuda SDK downloads so far
    • One of them being me
  • Up to 5.5 GBytes/s BW between CPU and GPU
  • Now supports asynchronous data transfer, so can be overlapped with compute
  • Up to 200 GFLOPS on single precision FFTs now (CUFFT 2.3), up to 70 GFLOPS DGEMM
  • Have great results speeding up Molecular Dynamics for such as Folding@Home
    • But only after the code had been completely re-written to take advantage of this kind of architecture, and Molecular Dynamics is a naturally good fit!
  • Some good results on Lattice Boltzmann
  • Hit 1.2 TFLOP LINPACK in 8 nodes (each with 2 GPUs)
    • This should improve by much more than 2X next year
  • Heterogeneous programming model is important to consider


Semantics consistent parallelism (slides)

Li-Yi Wei, Microsoft

  • Want to be able to harness multi-core and GPUs
  • Sequential consistency is about parallel algorithms giving the same results as a sequential algorithm
  • Not all algorithms need this though - e.g. lots of graphics problems, random number generation etc
  • "Similar" might be good enough and might give good benefits (by exploiting parallelism)
  • Pseudo random number generators are naively sequential but can be parallel
    • E.g. random hash (cryptographic?) across a linear sequence
  • Poisson disk sampling is an interesting application in this area
    • Samples computed in parallel but need to be uniformly distributed
    • Just discretise to a grid and compute cells in parallel (doesn't sound that clever to me?)
    • Can also use for image editing/morphing [Perez et al SIGGRAPH 2003]
    • Also Farbman et al SIGGRAPH 2009
    • I highly recommend checking this out:
    • Showed some good demos! (Movies available at the URL above)


Hot Seat Session 2

HPC for everyone

Robert Murphy, open storage, Sun

  • In 2005 a rack had 84 cores and hit 500 GFLOPS
  • Today 9 TFLOPS per rack (18X)
  • 2011 172 TFLOPS per rack? (seems a bit high to me!)
  • CERN will be generating 27 TBytes/day from LHC when running live
  • Genome sequencers and microscopes generating 1 TByte per day per investigator
    • Potentially hundreds of thousands of these around the world
  • Data access time can dominate HPC workflow
  • Not that interesting a talk
  • Wouldn't answer questions about the future of Sun, their hardware etc


Energy-efficient computing: challenges and solutions

Raphael Wong, Supermicro

  • Have shipped more than 15,000 1U 4-way servers to US national labs
  • Shipped more than 500,000 servers last year
  • 50% of high-end medical systems in the world based on Supermicro
  • Can now put two twin socket servers in a 1U enclosure
  • First to provide twin GPU in 1U servers (I like the look of these a lot)
  • 93% efficient power supplies
    • 90% efficient at light load (25%)
    • This is quite unusual today and impressive
  • Can get around 12 TFLOPS per rack of blades today, doubling shortly
  • Looking at water cooling as an option to increase density further
    • When Supermicro do something it's generally a sign that it's ready for the mainstream, so this is an interesting development


New developments & trends in scale-out computing

Frank Baetke, HP Europe

  • HP systems account for over 42% of systems in the Top500 (207 are HP blade-based systems)
  • SE2210 2U servers will take several GPUs
    • 84 GPUs per 47(?)U rack
  • Optimising servers to improve airflow and reduce component counts
    • E.g. SL2x170z G6 newly announced
  • HP Proliant z6000 chassis also looks very interesting
    • Does away with some of the fans, perforated metal skins to reduce weight and cost etc
    • Common power supplies and larger, more efficient fans for cooling
  • Expect cloud-like approaches to be adopted to provide HPC
    • I.e. put the datacentres in locations with cheap power, cooling, real estate etc.


Scalable architecture for the many-core era

Eng Lim Goh, SVP & CTO, sgi (one of the more interesting talks in this vendor section)

  • MPI collectives limit Fluent (CFD) scalability at 1024 cores
    • mpi_allreduce (most costly), waitall, reduce, barrier, recv
    • Spending more time communicating than computing at these scales
  • So looking at accelerating the comms
    • "Ultraviolet" project
  • Supporting global addressable memory up to 8 PetaBytes (!)
    • Based on Intel quad core (Nehalem)
  • MPI offload engine in their chipset hardware
    • Barrier, reduce, DMA gather/scatter in hardware for accelerating MPI
    • Get one per two x86 sockets
    • 53-bit addressing to support global addressable memory
  • Recommend a 2D torus to keep inter-node BW up
  • Looks really interesting, though proprietary (so probably expensive)


From PetaScale to Clouds - providing the ultimate networking solution for high performance

Michael Kagan, CTO, Mellanox

  • The new 1 PFLOP system at Juelich uses the latest 40 Gbps InfiniBand
    • Achieved 91.6% efficiency on LINPACK, one of the highest efficiencies achieved so far
  • Not very exciting talk
  • Mentioned http://www.hpcadvisorycouncil.com/


Myri-10G: 10-Gigabit Ethernet with a supercomputing heritage

Markus Fischer, Myricom

  • Basically they've plugged 10 GigE Ethernet phys (layer 1) to the Myrinet network protocols (layer 2)
  • Includes features for communication offload and kernel bypass operations
  • Uses the Lanai Z8ES 10GE NIC (PCI Express x8 with dual 10 GE ports)
    • Claim this is currently the fastest, lowest power, cheapest on market today
  • Have products compatible with IBM's blades
  • Can achieve 18.9 Gbits/s throughput with a 1,500 byte message size (MTU) - 20Gbps peak so a good result
    • Low power consumption: 3.3W for 10GE port
  • Have sophisticated, stateless firmware for comms offload, assisting the host CPUs
  • Believe 10GE will become much more popular in HPC


Enhanced scalability with QLogic InfiniBand

Philip Murphy, VP Engineering, QLogic

  • I have to confess I glazed over during this talk, not terribly exciting


High Performance, non-blocking reliable switching for HPC

Frank Laforsch, Force10

  • Their stuff is used in many of the fastest systems in the Top500, including the #1 system, Roadrunner
  • Designed for high reliability
  • Their products use 36W per 10GE port
  • Also not an exciting talk


NEC's future directions

Rudolf Fischer, CTO, NEC Germany

  • Said almost nothing about future directions
  • Believes vector is now ubiquitous (i.e. SSE in x86 CPUs) and classic style will also come back
    • I don't agree with vector coming back in the Cray/NEC classic style myself, don't think the economics make sense
    • But SIMD/SSE small vectors in x86 will keep growing, and GPUs are also vector like
  • Latest NEC vector machine is the SX-9
    • Has a 512-bit wide floating point register file
    • Also has 256 GBytes/s memory bandwidth
      • Even higher than a GPU, very impressive
    • Supports gather/scatter in hardware
  • Said he doesn't like GPUs because they are so loosely coupled to the host CPU
    • But integrating them should solve that I believe
  • Part of the PRACE Europe-wide supercomputer investigation with HLRS in Germany


That's all from the ISC09 3-day conference, I hope these write-ups were interesting and useful. Do let me know if you liked them or if you have any suggestions for future such conference reports. Thanks for the feedback so far!



© 2009 University of Bristol  |  Terms and Conditions