ISC'09 day 1

International SuperComputing 2009 (ISC '09)

ISC'09 is second the largest supercomputing conference on the calendar, and the largest outside the US.

The following notes are pretty much my live transcription as the conference unfolds so please forgive any typos, acronyms not explained etc. I hope you find this useful and/or interesting, please don't hesitate to get in touch if you have any questions!

Conference website.

See also day 2 and day 3 of the conference.

ISC, now in its 24th year, is a conference that's been growing quite fast. Attendance is up from 647 in 2005 in Heidelberg to 1375 last year in Dresden to 1527 this year in Hamburg. The conference also had 120 exhibitors, both vendors and High Performance Computing (HPC) users, so it's now quite a large conference.

Talks include a keynote by Andy Bechtolsheim, co-founder of Sun, to Thomas Sterling, one of HPC's stalwarts and industry commentator.

Highlights from the latest Top500 list of the fastest supercomputers (slides)

Dr Eric Strohmaier (LBNL)

The top 2 systems stayed the same, both in systems in the US and the only systems capable of over 1 PetaFLOP (one trillion floating point operations per second!)
- These systems have over 100,000 cores each
- The fastest system is RoadRunner, a heterogeneous many-core system using IBM's Cell processor at LLNL
Europe now has the third fastest system, 825 TFLOPS at Juelich in Germany with a system of nearly 300,000 cores (the most cores on the list)
Eight of the top ten systems are based in the US and are from IBM, Cray and Sun
The average power consumption of a top 10 system is 2.45 MegaWatts (!)
The number 10 system is 275 TFLOPS, again at Juelich in Germany (first time Europe has two in the top 10)
Saudi Arabia now has a system at #14, the highest ever showing from a middle eastern system
The fastest system in Asia is now in China, with a 180 TFLOP system in Shanghai
Only one vector processor-based system in the Top500 system these days (the new NEC-based upgrade of the EarthSimulator)
Performance trends are remaining steady and on Moore's Law, even amid the current economic downturn
IBM and HP share most of the market between them
Industries share of the Top500 keeps growing, though few of these are large (most outside the top 50)
UK has been growing particularly strongly in terms of the number of systems in the top 500
For the first time China has more systems in the top 500 than Japan
Almost all systems in the top 500 are clusters or Massively Parallel Processing (MPP) systems such as BlueGene
53% of all 500 systems use Intel Xeon 54xx Harpertown quad core processors
Over 75% are various Intel quad core processors, nearly 50% if measured by performance
77% of systems already have four cores per socket, 21% are still dual core
Eight core processors from Intel expected to show up in the next Top500 list
Interconnects are mostly Gigabit Ethernet (GigE) (~280 systems) and InfiniBand (IB) (~140 systems)
- Although very little GigE in the top50, here it's mostly IB
LINPACK efficiency ranges from 93% to below 30% (the special purpose GRAPE system)
- Lots of systems around 54% using GigE
- IB systems tend to achieve around 80% efficiency
Power consumption:
- Jaguar uses 7MW!
- Only a few systems above 1MW
- Most systems still using several KiloWatts though
- Most power efficient systems are based on IBM's Cell (RoadRunner)
- Special purpose systems are next most power efficient (GRAPE-DR)
- PowerPC 450's are next most power efficient
- Quad core systems then come in at around a couple of hundred MFLOPS per Watt
Many of the most power efficient systems are using IB as an interconnect
The slowest systems in the Top500 are now over 17 TFLOPS!
For more trends see the press release and performance development graphs.

Keynote speech, "The path from Petaflops to Exaflops" (slides)

Andy Bechtolsheim, co-founder of Sun

HPC accounts for 30% of all server sales today and this fraction is growing
- Most of the rest goes into the web industry
- Classic "enterprise" computing has been shrinking
HPC is worth $20bn (compute, storage and services) by 2012
10 GigE and IB now shipping in volume
100 PetaFLOP system expected in 2016, 1 ExaFLOP by 2020 (one thousand trillion floating point ops/s)
Need performance to double every year to 2020 to hit 1 ExaFLOP
Moore's Law staying on track to deliver twice as many transistors per device every two years
8nm process by 2020, having a 10 TFLOP socket (processor) in that year
- Confirmed by Intel in a later talk
Expect ~160 cores per CPU at ~4GHz and 16 FLOPS per cycle per core by 2020
- Also 2 TeraBytes/s bandwidth per socket by 2020
- 500W per socket? Hmm...
That's aiming at 20 GFLOPS per watt by 2020
Would still need 50MW for the ExaFLOP system (!!!)
Expecting to use multi-chip 3D packaging
- Already being used in, for example consumer electronics such as cell phone chips
Could also integrate fabric I/O, i.e. integrate router with the CPU
Expecting a combination of mesh and tree interconnect topology
Expecting 50 Gbps per lane in 2016, so 100 Gbps by 2020?
Believes MCM is the single biggest saving for power use
- Could save 50% compared to server processors today
Microchannel fluidic heat sinks may be required (water cooling right on the chip)
Predicts all HPC systems will use water cooling in the future (it's the most power efficient way of cooling)
Expect to need 100 TB/s storage BandWidth (BW) by 2020
- Will need solid state disks (SSDs)
  - Much lower power
  - Better for random access
  - More reliable - no moving parts
ASPs on flash memory reducing by nearly 50% per year per GigaByte.
- Expecting flash to replace disk drives
- Performance improving rapidly
- Flash will be fast enough to do random IO
  - I.e. it will just become part of the memory hierarchy, just a larger, slower RAM
Expecting to need 16 million cores in the ExaFLOP system
- In 100,000 sockets
Predicts smallest machine in Top500 in 2020 will be 10 PetaFLOP!
Clock rates are likely to stay relatively low at around 4GHz
- This helps in terms of interconnect latencies
Mentioned GPU processors as one of the ways forwards - "The jury is out"
- The economic advantage of a mainstream market is essential, hence x86 and GPU are two frontrunners
Said HPC is growing as a market so remains attractive to Sun -Exascale data important to Oracle too

Implementation of a lattice-Boltzmann method for numerical fluid mechanics using the Nvidia CUDA technology (slides)

E. Riegel, T.Indenger, N.A. Adams, TU Munchen, Institute of Aerodynamics

Computational fluid mechanics for incompressible flows (slower than Mach 0.3 at sea level, for example)
- Spacial discretization by partitioning space into cells
Propagation and collision steps
Lattice-Boltzmann Method (LBM) is a very parallel algorithm so a good target for GPUs
- Each cell can be computed independently
  - Can also use multiple GPUs for even greater performance
SunlightLB - an open source LBM code (D3Q15) for traditional CPUs written in C
- They tried just porting this first
- http://sunlightlb.sourceforge.net/
- Expected a speed-up of 15X (64x64x64 voxels)
- Actually only got a speed-up of 1.5X, ten times lower than expected
  - CPU 9.0 MVPS (million voxels per second), GPU 13.4 MVPS
- GPU memory access patterns were the problem
So wrote a new LBM code from scratch targeting GPUs
- LBultra
  - Written in C++
  - D3Q15 fixed refinement Cuda and CPU multi-core kernels supported
  - Optimised memory access patterns
  - Reduced data transfer to the GPU by fusing propagation and collision phases
  - Used GPU's "shared" memory for explicit data caching
  - Ported the Cuda code back to a multi-core host version - interesting!
  - Achieved 9.3X speed-up of 1 GPU vs. 1 CPU
    - CPU 8.4 MVPS, GPU 78 MVPS, using three GPUs achieved 191 MVPS
  - Still not quite optimal memory access patterns or distribution code
  - Validated the port with a simulation of a common test case which is a flow around a sphere
  - Has added ability to use an adaptive mesh, though this isn't quite finished yet (warning!!!):
    - Node resolution is location dependent
    - CPU ~ 10 MVPS, (40 GFLOPS) GPU ~400 MVPS (>900 GFLOPS, near peak performance)
    - HASN'T used SSE (SIMD) optimisations on the host so the host could go up to 4 or 8X faster too (i.e. host could get 160-320 MVPS vs. 400 MVPS for the GPU)

A novel multiple walk parallel algorithm for the Barnes-Hut treecode on GPUs - towards cost-effective, high-performance N-body simulation (slides)

T. Hamada (Nagasaki), K. Nitadori (RIKEN), Japan

A strong believer in using GPUs for HPC
Want to use N-body for general purpose computing, not just astrophysics
- E.g. Fluid dynamics (Smooth Particle Hydrodynamics, vortex method etc.)
- Acoustics, electromagnetics (Boundary Element Method)
Research was using Nvidia GT200-based GPUs
Want to simulate large-scale cosmological systems
- Billions of particles
Have 256 GPUs connected by a cheap GigE network
Running since May 2008
1.5 billion particles using Barnes-Hut on 256 GPUs computed in 17 seconds per time step
Of course N-body methods are classically considered to be one of the main class of algorithms
- Such as the "View from Berkeley" including it as one of its seven algorithmic exemplars or "dwarfs"
Achieving about 450 GFLOPS on the latest GPUs
Processed one particle per thread on the GPU (up to 2,048 threads and thus particles per GPU)
- This was a fairly simple, naive approach though
Really needed to take advantage of "cut-off" distances to reduce amount of computation required (down from O(n^2) to O(n))
- "Multiple Walks" new approach much better (but didn't explain how this works very well)
In April this year achieved 50 TFLOPS (single precision) on a large system (500 million particles) using 256 GPUs
- About 124 MFLOPS per $ including host computer - this is very good indeed!
- 2 GPUs per host computer

Faster FAST: multicore acceleration of streaming financial data (slides)

Davide Pasetto et al, IBM

Message processing rates have exploded since 2006 when stock exchanges became more automated (exponential growth)
Around 1 million messages per second in 2008 requiring analytics to be calculated
Need the answer in milli- or even microseconds (otherwise might miss the deal)
- Worth millions of dollars
- But it's an arms race!
Financial institutions already adopting HPC technologies:
- InfiniBand and 10 GigE
- Low-latency OS-bypass communication protocols
- Hardware accelerators (GPUs, FPGAs etc.)
- Latest generation of multi-core processors
The financial institutions are struggling to test and validate massively parallel software
To keep latency low the financial datacentres tend to live in Wall St or Canary Wharf, with corresponding space and power supply and cooling limitations
The "ticker" plant:
- Incoming live market, exchange and consolidated feeds
- Decode and normalize this data
- Analytics and data caching
- The distribute the results of this pre-processing to their users - traders, customers etc.
- Want the microsecond latency from data arrival to results getting back to the user
This paper focused on Options Pricing Reporting Authority (OPRA) feeds on the US stock exchange
Could they do what was required using just off the shelf CPUs?
Messages peaking over 1 million per second
- Distributed in a compressed format
- Fastest growing data feed, growing exponentially
- Existing solutions used FPGAs or multi-cores
Uses multi-cast technology
Also uses bit level encoding - Most Significant Bit denotes if this byte is the end of a field
Fields are either unsigned integers or character strings
There is a reference decoder for OPRA
Built their own implementation of OPRA bottom up, optimizing the most important kernels
Did use assembly-level optimisation including SSE, intrinsics etc.
Got 3-4X speed-up vs. the reference decoder (I'm surprised it's not more actually)
Intel quad core achieved the highest performance
This was actually better than using FPGAs (though didn't show FPGA performance as a comparison - naughty)
Answered a question from me that FPGAs get around 2.5 million messages per second while a single core of an Intel Nehalem CPU should reach 4 million messages per second

High Performance Computing for the simulation of large scale aircraft structures

Martin Kussner, Abaqus/3DS MD in Germany

Multi-scale modelling is the big thing in aero today
Movement towards more non-linear Finite Element (FE) analysis, partly driven by composite materials
1M Degrees Of Freedom (DOF) used to be a big problem but not any more - 50M is a big problem today
A PRACE paper reported later has achieved 500M DOF...
Driven by need for:
- Shorter design cycles
- Improved performance and economy
Said a large system today is:
- 10-20M DOF
- 3-7M elements
- 5-10,000 discrete fasteners
- 2,000 composite layers
Described an implicit, direct solver using distributed memory as the worst case for parallelisation
Aiming for clusters of >1000 cores for Abaqus software in the future
Most users have 64-256 cores per simulation today where the software currently performs quite well

Achievement and future needs in HPC

Detlef Mueller-Wiesner, COO of EADS France

PRACE is a European supercomputer initiative to understand large-scale PetaScale systems
Want to eliminate the need for physical testing and rely solely on computer simulation of new aircraft
Sell the first built new airplane, not just for testing!
Also want to be able to predict flight performance prior to the first flight
Authorities already accept simulations for an electromagnetic (EM) test of a change to an existing aircraft system
EADS believes it will need to increase its HPC performance by 100% per year
User interface is critical for simulations - how do the users interact with, understand and interpret the results?
Have CFD simulated an A380 in landing and take-off configurations, including ground effect and landing gear, all within 1% of the measured results, an amazing result!
Also big users of Fast Multipole Methods (FMM) for large electromagnetic simulations
- O(nlogn) vs. O(n^3) for traditional, more LINPACK-like methods
FMM can be used for EM, acoustics, vibration analysis, heat transfer and elasticity
"Supercomputing is innovation in action!"

High scalability multipole methods: solving half a billion unknowns (slides)

J. Mourino (not Jose) et al, Supercomputing Centre of Galicia, Spain

Want to be able to use high frequencies in electromagnetic simulations of large objects
- Real car at 79GHz -> 400 million unknowns
- These are frequencies used by in-car collision avoidance systems
Traditional solution is Method of Moments (MoM) - LINPACK-like
New method, fast multipole (FMM) scales as O(n^3/2)
Multilevel FMM scales as O(nlogn) but poor scalability across many processors
Their new method is FMM-FFT
Full domain is divided into groups in a 3D circular convolution style
Uses the FFT to speed-up the translation stage
Modern supercomputers are getting very good at doing large FFTs and scaling well
With this method a single global communication step is required at the end of the Matrix Vector Product
But this method uses lots of memory - further refinements have addressed this
HEMCUVE is the name of their code, written in C++
Needs 6 GBytes per core
Scaled really well to 1024 processors
Have a 2,580 cores, 20 TByte, 16 TFLOP system to use for this
- Called FinisTerrae
- Uses Intel Itanium 2 Montvale with 64 GB per node (8 GB per core) - HP system rx7640 nodes
- Hit an MPI problem of a limit of 2GBytes per message
- The only talk I saw at the whole conference that mentioned Intel's Itanium architecture
Still takes 30 hours to run an entire problem
Have simulated a Citroen C3 car at 24 GHz (radar frequency) which needed 40M unknowns
Also done with 79GHz for anti-collision frequency needing 400M unknowns, 10X
They're working on solving 1B unknowns using 2000 cores
They haven't measured peak performance so can't say much about how well they're really doing

Parallel scalable PDE-constrained optimisation: antenna identification in hyperthermia treatment planning (slides)

O. Schenk et al, University of Basel (collaborators at Purdue)

Many different problems are instances of trying to solve large-scale non-linear optimization
Will be using BlueGene/L #3 system and also their own 64-node Intel Xeon cluster
Aiming to solve systems of 1M to 1B variables/unknowns
The hypothermia treatment they're looking at uses heat applied to a tumour at around 41-45 degrees C
Typically formulated as a PDE-constrained optimisation problem
Inequality constraints:
- State variables: temperature distribution
- Control variable - EM antenna placement
Electrical field used to induce the heat, blood flow diffuses this away
Used NLP optimiser called IPOPT (open source C++, >5000 users) with linear equation solver PSPIKE
They've been able to scale up to 512 cores for their biomedical PDE-constrained optimisation
There was a lot of heated debate about whether they'd done this is a practical way

See also day 2 and day 3 of the conference
Back to conferences
Back to Simon's home page