# The resurgence of parallel programming languages

### Jamie Hanlon & Simon McIntosh-Smith University of Bristol Microelectronics Research Group hanlon@cs.bris.ac.uk







#### The Microelectronics Research Group at the University of Bristol

#### www.cs.bris.ac.uk/Research/Micro





#### Ke The team







Simon McIntosh-Smith Head of Group

Prof David May

Prof Dhiraj Pradhan









Dr Jose Nunez-Yanez

Dr Kerstin Eder

Dr Simon Hollis

Dr Dinesh Pamunuwa

7 tenured staff, 6 research assistants, 16 PhD students



# Keroup expertise

#### Energy Aware COmputing (EACO):

- Multi-core and many-core computer architectures
  - Inmos, XMOS, ClearSpeed, Pixelfusion, ...
- Algorithms for heterogeneous architectures (GPUs, OpenCL)
- Electronic and Optical Network on Chip (NoC)
- Reconfigurable architectures (FPGA)
- Design verification (formal and simulation-based), formal specification and analysis
- Silicon process variation
- Fault tolerant design (hardware and software)
- Design methodologies, modelling & simulation of MNT based structures and systems



### **K**Overview

- Parallelism in computing
- Overview and discussion of current parallel languages
  - Chapel (HPC)
  - OpenCL (desktop/embedded)
- Moving forward
  - Heterogeneous System Architecture
  - Research into general purpose highly parallel architectures





#### Didn't parallel computing use to be a niche?





© Simon McIntosh-Smith, Jamie Hanlon

### KA long history in HPC...





#### **K** But now parallelism is mainstream



#### Quad-core ARM Cortex A9 CPU Quad-core SGX543MP4+ Imagination GPU



© Simon McIntosh-Smith, Jamie Hanlon

# **K**HPC stronger than ever



- 705,024 SPARC64 processor cores delivering 10.51 petaflops (10 quadrillion calculations per second)
- No GPUs or accelerators
- 9.9 MW



#### 2<sup>nd</sup> fastest computer in the world



- Tianhe-1A in Tianjin, China
- 2.6 petaflops
- 14,336 Intel 2.93 GHz CPUs (57,334 cores)
- 7,168 NVIDIA Tesla M2050 GPUs (100,000 cores)
- 4 MW power consumption



#### Kerne Big computing is mainstream too





© Simon McIntosh-Smith, Jamie Hanlon

#### Ke A renaissance in parallel programming

#### CSP

- Erlang
- Occam-pi
- XC

#### GPGPU

- OpenCL
- CUDA
- HMPP
- OpenACC

#### Message-passing

MPI



#### **Multi-threaded**

- OpenMP
- Cilk
- Go

#### **Object-orientated**

- C++ AMP
- CHARM++

#### PGAS

- Co-array Fortran
- Chapel
- Unified Parallel C
- X10



# Chapel





© Simon McIntosh-Smith, Jamie Hanlon

# **K**Chapel

- Cray development funded by DARPA as part of HPCS program
- Partitioned global address space (PGAS) language
  - Central abstraction is a global array partitioned across a system
- Programmer control of locality by allowing explicit affinity of both tasks and data to locales



### Kerrays and distribution

- Several array types
- Can be distributed with a domain map
  - Standard maps and can be user-defined
- Computation can remain the same regardless of a specific distribution





# KeChapel's data parallelism

• Zippered forall:

```
forall (a, b, c) in (A, B, C) do
a = b + alpha + c
```

- loop body sees ith element from each iteration
- Works over:
  - distributed arrays
  - arrays with different distributions
  - user-defined iterators A,B,C could be trees or graphs



### K≪MPI+OpenMP

```
#include <hpcc.h>
#ifdef OPENMP
#include <omp.h>
#endif
static int VectorSize;
static double *a, *b, *c;
int HPCC StarStream(HPCC Params *params) {
  int myRank, commSize;
  int rv, errCount;
 MPI Comm comm = MPI COMM WORLD;
 MPI Comm size( comm, &commSize );
 MPI Comm rank ( comm, &myRank );
 rv = HPCC Stream( params, 0 == myRank);
 MPI Reduce ( &rv, &errCount, 1, MPI INT, MPI SUM,
    0, comm );
  return errCount;
}
int HPCC Stream(HPCC Params *params, int doIO) {
 register int j;
  double scalar:
 VectorSize = HPCC LocalVectorSize( params, 3,
    sizeof(double), 0 );
  a = HPCC XMALLOC( double, VectorSize );
 b = HPCC XMALLOC( double, VectorSize );
  c = HPCC XMALLOC( double, VectorSize );
```

```
if (!a || !b || !c) {
    if (c) HPCC free(c);
    if (b) HPCC free(b);
    if (a) HPCC free(a);
    if (doIO) {
      fprintf( outFile, "Failed to allocate memory
        (%d).\n", VectorSize );
      fclose( outFile );
   return 1;
  }
#ifdef OPENMP
#pragma omp parallel for
#endif
  for (j=0; j<VectorSize; j++) {</pre>
  b[i] = 2.0;
   c[i] = 0.0;
  }
  scalar = 3.0;
#ifdef OPENMP
#pragma omp parallel for
#endif
  for (j=0; j<VectorSize; j++)</pre>
    a[j] = b[j]+scalar*c[j];
  HPCC free(c);
  HPCC free(b);
  HPCC free(a);
  return 0;
```

# **K**Composition in Chapel

Data parallelism

```
cobegin {
  forall (a, b, c) in (A, B, C) do
    a = b + alpha * c;
  forall (d, e, f) in (D, E, F) do
    d = e + beta * f;
}
```

Task parallelism nested in data parallelism

```
forall a in A {
    if a == 0 then
        begin a = f(a)
    else
        a = g(a)
```



# Issues with Chapel

- HPC-orientated: not suitable for general programming, e.g. embedded platforms
- Locales support only a single level hierarchy
- No load balancing/dynamic resource management
- Too high level? Is it a good abstraction of a parallel machine?





# **OpenCL** (Open Computing Language)





© Simon McIntosh-Smith, Jamie Hanlon

# **K**OpenCL

- Open standard for portable, parallel programming of heterogeneous systems
- Lets programmers write a single portable program that uses all resources in the heterogeneous platform

A modern system includes:

- One or more CPUs
- One or more GPUs
- DSP processors
- -...other devices?





# OpenCL platform model



- One <u>Host</u> + one or more <u>Compute Devices</u>
  - Each Compute Device is composed of one or more Compute Units
  - Each Compute Unit is further divided into one or more <u>Processing</u> <u>Elements</u>

# Kerne The BIG idea behind OpenCL

- Replace loops with functions (a <u>kernel</u>) executing at each point in a problem domain (index space).
- E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions

#### **Traditional loops**

| void                                          |  |
|-----------------------------------------------|--|
| <pre>trad_mul(const int n,</pre>              |  |
| const float *a,                               |  |
| const float *b,                               |  |
| float *c) {                                   |  |
| <pre>int i;</pre>                             |  |
| for (i=0; i <n; i++)<="" th=""><th></th></n;> |  |
| c[i] = a[i] * b[i];                           |  |
| }                                             |  |

University of

DISTOL

#### Data parallel OpenCL



# OpenCL memory model

- Private Memory
  - Per Work-Item
- Local Memory
  - Shared within a Work-Group
- Global / Constant Memories
  - Visible to all Work-Groups
- Host Memory
  - On the CPU



Memory management is explicit You must move data from host  $\rightarrow$  global  $\rightarrow$  local *and* back



# Issues with OpenCL

- It does not compose
  - Disjoint memory address spaces (local/global)
  - Barriers
- It provides no resource management
  - Kernels are a statically allocated resource





#### Heterogeneous System Architecture (HSA)





© Simon McIntosh-Smith, Jamie Hanlon

### Kerview

- Announced recently by AMD as new open architecture specification
  - HSAIL virtual ISA
  - HSA memory model
  - HSA dispatch
- Provides an optimised platform architecture for OpenCL
- Already being adopted by other vendors starting with ARM



#### Keine HSA features: simplifying programming

- Integration of CPU and GPU in silicon
  - Unified memory controller
- Unified address space for CPU and GPU
- Potentially even GPU context switching!
- HSA programming model introduces PGAS-style distributed arrays
  - Memory hierarchy abstraction to address function composition
- First class barrier objects

University of

#### Kernediate Layer (HSAIL)

- Virtual ISA for parallel programs
- Similar idea to LLVM IR a good target for compilers
- Finalised to specific ISA by a JIT compiler

#### • Features:

- Explicitly parallel
- Support for exceptions, virtual functions and other high-level features
- Syscall methods (I/O, printf etc.)
- Debugging support



# Kerver HSA memory model

- Compatible with C++11, Java and .NET memory models
- Relaxed consistency



### KHSA dispatch

- HSA designed to enable heterogeneous task queuing
  - A work queue per core
  - Distribution of work into queues
  - Load balancing by work stealing





#### Research into highly parallel & general purpose architectures





© Simon McIntosh-Smith, Jamie Hanlon

# **K**Composability

Need for general purpose parallel processors

Must support many algorithms, even within a single application

Task farms, pipeline, data parallelism, ...





### A scalable architecture

- 1 1000 of cores per chip
  - Potentially millions of cores in a system
- Regular tiled implementation on chips, modules and boards





# Interconnect performance

- Must provide low latency, high throughput communication
- This must scale well with the number of processors
- Clos & hypercube networks provide these properties but it is assumed they are prohibitively difficult to build
  - Low dimensional meshes seem to be the convention
- Potential in new technology: 3D stacking, silicon substrates, optical interconnections, ...



## **K**Summary

- Parallel languages are going through a renaissance
- Not just for the niche high-end any more
- No silver bullets, lots of "wheel reinventing"
- In HPC, GPUs being adopted quickly at the high-end
- In embedded computing, OpenCL gaining ground
- Movement towards high level general purpose models of parallelism



#### www.cs.bris.ac.uk/Research/Micro





