SC09 Conference Day 1

Keynote address – Justin Ratner, Intel

o 22% paper selection rate this year

o Exhibition space sold out, even in the current economic conditions

o HPC business growth (CAGR) is averaging around 3-4% over the next few years (very modest)

o HPC doesn’t support it’s own R&D (and never has)

o Government funding (not healthy for the long term)

o Must instead leverage high-volume – e.g. using commodity processors etc

o HPC needs a “killer app”

o Is the future of HPC the 3D web? (Huh?)

o Continuously simulated

o Multi-view 3D animated

o Immersive and collaborative

o Intel has around 40,000 servers it uses during its chip developments

o www.sciencesim.com is one example of a virtual world for immersive science – based on OpenSim distributed simulator

o Want objects in a 3D web to use real physics models and to generate true physical sounds – all highly computationally expensive, and far from possible in real time today

o Showed off several systems using the newly announced Nehalem-EX 8 core CPUs

o Mentioned the “battle” between CPUs and GPUs for extreme computing

o Described Intel’s Larrabee many-core processor as a “computational co-processor” – up to now Intel has been strict about calling it a GPU only

o Showed a live demo running single precision matrix multiply (SGEMM) at around 380 GFLOPS (4Kx4K, larger than the caches using half the cores on the chip – why?) then at around 700 GFLOPS using all the Larrabee cores

o Still less than most other GPUs though

o Then showed a sparse matrix computation – SpMVM, large >50kx50k, very sparse matrices (FEM & QCD), hit around 8 GFLOPS – this isn’t that impressive either. Claimed it was very straightforward code though

o Pushed Ct (C for throughput) as a parallel programming model

o Can run Ct heterogeneously across multi-core CPUs and Larrabee GPUs

o Showed a block diagram of a Xeon-Larrabee hybrid processor

o Uses nested vectors

o Lots of demos of Ct on the Intel booth apparently

o Also talked about sharing memory spaces between CPUs and GPUs

o Sharing at the page level

o M-Y-O – the page is Mine, Yours or Ours

o Showed an underwhelming demo of this live that looked like it was from the 80s

o “Nothing more important to the long-term health of the HPC industry than the 3D web” – that’s a big and controversial claim

o In all it felt like an uninspiring and commercially quite cynical keynote

o Finished by showing an overclocked Larrabee hitting 1TFLOP on SGEMM

Increasing Memory Miss Tolerance for SIMD Cores

David Tarjan, Jiayuan Meng and Kevin Skadron Department of Computer Science University of Virginia, Charlottesville

o Neat idea to let PEs in a SIMD array run out of sync when some cores block but others can proceed

o Claim “Diverge on miss can either increase the performance of a given design by up to a factor of 3.14 for a single warp per core, or reduce the number of warps per core needed to sustain a given level of performance from 16 to 2 warps, reducing the area per core by 35%.”

o David Tarjan has now been hired by Nvidia!

Triangular Matrix Inversion on Graphics Processing Unit

Florian Ries, Tommaso De Marco, Matteo Zivieri, Roberto Guerrieri ARCES – University of Bologna

o First triangular matrix inversion on GPUs

o Good description in the paper

o Uses a recursive triangular matrix inversion technique (based on work by Strassen, Balle and Hansen)

o Highly parallel and recursive

o SIMD friendly

o Uses a fractal segmentation of the triangular matrices into square matrices with dimensions m*2^k

o Have many sub-blocks that can then be processed in parallel

o Tried this on Nvidia GTX 280 (latest greatest available right now)

o Made some clever use of tiling with pre-calculated LUTs for sub-addresses in the GPU constant cache

o Achieved around 13X speedup vs. best performance on a fast Intel quad core (Core i7-975 quad-core) even when using double precision (a lot slower than single precision on the GPUs they were using)

o When the matrices are around 8192 in size, hit 54 GFLOPS in double precision, 149 GFLOPS in single precision (must be mostly bandwidth limited rather than compute limited)

o This is 70% of double precision peak performance, very good

o Can use two GPUs at once by simultaneously performing the L inversion on one GPU and the U inversion on the other GPU

o Hit up to 90 GFLOPS in double precision for this

Auto-Tuning 3-D FFT Library for CUDA GPUs

Akira Nukada, Satoshi Matsuoka, Tokyo Institute of Technology

o Some very nice work – here’s the abstract:

o “Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting auto- tuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.”

o Autotuning of kernels to hardware variants is an important trend in heterogeneous computing

o Has to select radices and ordering of the FFT kernels

§ E.g. do a 240 point FFT as steps with radices 4, 4, 3, 5

o Select number of threads

o Avoid DRAM memory bank conflicts by using appropriate padding

o Have results on an Nvidia GTX285 and an Intel Core i7 (Nehalem)

o Up to around 160 GFLOPS for 256^3 single precision 3D FFT (fast)

o Better than even Nvidia’s own FFT library for Cuda

o Much faster than quad core host CPUs

o The autotuning itself took up to a minute to run (not that long I thought)

o Using around 120 GBytes/s of bandwidth to achieve this performance

o NukadaFFT library 1.0 beta release November 2009

o Going to look at porting to OpenCL and Nvidia’s next generation architecture, Fermi

A massively parallel adaptive fast-multipole method on heterogeneous architectures

Ilya Lashuk∗, Aparna Chandramowlishwaran∗, Harper Langston∗, Tuan-Anh Nguyen∗, Rahul Sampath∗, Aashay Shringarpure∗,

Richard Vuduc∗, Lexing Ying†, Denis Zorin‡, and George Biros∗ ∗ Georgia Institute of Technology, Atlanta, GA 30332

† University of Texas at Austin, Austin, TX 78712 ‡ New York University, New York, NY 10002

o Abstract: “We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC ’03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY- based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30× speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU- only based implementations.”

Efficient Band Approximation of Gram Matrices for Large Scale Kernel Methods on GPUs

Mohamed Hussein, Wael Abd-Almageed

o Abstract – “Kernel-based methods require O(N 2 )time and space complexities to compute and store non-sparse Gram matrices, which is prohibitively expensive for large scale problems. We introduce a novel method to approximate a Gram matrix with a band matrix. Our method relies on the locality preserving properties of space filling curves, and the special structure of Gram matrices. Our approach has several important merits. First, it computes only those elements of the Gram matrix that lie within the projected band. Second, it is simple to parallelize. Third, using the special band matrix structure makes it space efficient and GPU-friendly. We developed GPU implementations for the Affinity Propagation (AP) clustering algorithm using both our method and the COO sparse representation. Our band approximation is about 5 times more space efficient and faster to construct than COO. AP gains up to 6x speedup using our method without any degradation in its clustering performance.”

o Couldn’t get to grips with this, was difficult to understand the motivation

Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore Processors

Kamesh Madduri†, Samuel Williams†, Stéphane Ethier‡, Leonid Oliker† John Shalf†, Erich Strohmaier†, Katherine Yelick†★ †CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 ‡Princeton Plasma Physics Laboratory, Princeton, NJ 08543 †★EECS Department, University of California at Berkeley, Berkeley, CA 94720

o Abstract – “We present multicore parallelization strategies for the particle- to-grid interpolation step in the Gyrokinetic Toroidal Code (GTC), a 3D particle-in-cell (PIC) application to study turbulent transport in magnetic-confinement fusion devices. Particle-grid interpolation is a known performance bottleneck in several PIC applications. In GTC, this step involves parti- cles depositing charges to a 3D toroidal mesh, and multiple particles may contribute to the charge at a grid point. We design new parallel algorithms for the GTC charge deposition kernel, and analyze their performance on three leading multicore platforms. We implement thirteen different variants for this kernel and identify the best-performing ones given typical PIC parameters such as the grid size, number of particles per cell, and the GTC-specific particle Larmor radius variation. We find that our best strategies can be 2× faster than the reference optimized MPI implementation, and our analysis provides insight into desirable architectural features for high-performance PIC simulation codes.”