COMP35112 - Parallel Programming Using Shared Memory

class: center, middle
background-image: url(include/title-background.svg)
# COMP35112 Chip Multiprocessors
<br/>
<br/>
# .white[Parallel Programming Using Shared Memory]

.white[Pierre Olivier]

---
class: inverse, center, middle

# Threads

---
name: proc
# Processes

- A program executing on the CPU runs as a **process**
- With virtual memory, the process gets the illusion it has access to 100% of the memory
  - **Address space**

???
- When a program executes on the CPU, on top of the operating system, it runs
  as an entity named a process
- Using virtual memory, the OS gives to the process the illusion that it can
  access all the memory, so each process in the system is free to use any
  address in the lower half of the 64 bits address space, the upper one being
  reserved for the OS
- That area of memory that the process can address is named the address space

---
template: proc

- Two programs: 2 different processes, i.e. 2 different address spaces

???

- Two processes have two separate address spaces
- For example here process A and B both have some data at address 42, however
  that data is different

---
template: proc

- Two programs: 2 different processes, i.e. 2 different address spaces

???
- This is because the virtual address `0x42` for both processes is mapped to different physical location in the page table
- Indeed each process has its own page table.
- To put things simply, 1 process == 1 virtual address space == 1 page table.
- The OS also makes sure that processes cannot address anything outside of their
  address space
- To leverage parallelism, two processes, i.e. two programs, can run on two different cores of a
  multicore processor
- However, how can we use parallelism within the same program?

Can run 2 independent processes in parallel on a multicore, but **how to parallelise a single program**?

---
name: threads
# Threads

- A thread is a sequence of instructions executed on a CPU core
- A process consists of one or more threads
- Each process in the system has an address space
- **All threads in the same process share the same address space**

???
- A thread is a flow of control executing a program
- It is a sequence of instructions executed on a CPU core
- The nice thing with threads is that a process can consist of one but also
  multiple threads
- In other words, we can have for a single program several execution flows running
  on different cores and sharing a single address space

---
template: threads

---
template: threads

???
- In our example process A has 3 threads, and they all see the green and orange
  data.
- Process B has 4 threads, and they see the same red and yellow data

---
template: threads

.center[**Threads communicate using shared memory**]

???

- Seeing the same address space is very convenient for communications, that can
  be achieved through global variables or pointers to anywhere in the address
  space

---
# Threads

- Can program with threads in a plethora of languages:
  - **C/C++**/Fortran **using the POSIX threads library** (Pthreads)
  - Java
  - Python
  - Go
  - Haskell
  - Rust
  - etc.
--

- Thread support in higher-level languages generally builds upon pthreads
  - Most of these languages are built in or at least contain some C/C++ elements

???
- Threads can be created in many languages, including C/Fortran using the POSIX
  threads (Pthread) library
- But also Java, as we will see in detail in this course
- And many other languages

---
# Threads in C/C++ with Pthread

.leftcol[
- **POSIX thread library**
- Expose a standardised C/C++ API to create and manage multiple threads within a process
- Standard program: OS creates automatically a single thread
]

.rightcol[
<div style="text-align:center"><img src="include/pthread-1.svg" width=400 /></div>
]

---
# Threads in C/C++ with Pthread

.leftcol[
- Use `pthread_create()` to create and launch a thread
  - Indicate as parameters:
      - Which function the thread should run
      - Optionally what should be passed as parameter to this function
]

.rightcol[
<div style="text-align:center"><img src="include/pthread-2.svg" width=400 /></div>
]

---
# Threads in C/C++ with Pthread

.rightcol[
<div style="text-align:center"><img src="include/pthread-3.svg" width=400 /></div>
]

---
# Threads in C/C++ with Pthread

.leftcol[
- Pthread -- POSIX thread library
- Use `pthread_create()` to create and launch a thread
  - Indicate as parameters:
      - Which function the thread should run
      - Optionally what should be passed as parameter to this function
- `pthread_exit()` to have the calling thread exit
- `pthread_join()` to wait for another thread to finish
]

.rightcol[
<div style="text-align:center"><img src="include/pthread-4.svg" width=400 /></div>
]

---
# Threads in C/C++ with Pthread

- A good chunk of this course, including labs 1 and 2, will focus on shared
  memory programming in C/C++ with pthreads
- `man pthread_*`<sup>1</sup> and Google “pthreads” for lots of documentation
  - In particular see the Oracle Multithreaded Programming Guide:
  https://bit.ly/3FGt3k2

<sup>1</sup> First install the relevant development man pages:
```shell
sudo apt install manpages-dev manpages-posix manpages-posix-dev
```

???

---
# Threads in C/C++ with Pthread

.leftcol[
```c
// Compile with:
// gcc pthread.c -o pthread -lpthread

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h> // pthread header

#define NOWORKERS 5

// Function executed by all threads
void *thread_fn(void *arg) {
    int id = (int)(long)arg;

printf("Thread %d running\n", id);

pthread_exit(NULL);  // exit

// never reached
}
```
]

.rightcol[
```c
int main(void) {
  // Each thread is controlled through a
  // pthread_t data structure
  pthread_t workers[NOWORKERS];

// Create and launch the threads
  for(int i=0; i<NOWORKERS; i++)
*   if(pthread_create(&workers[i], NULL,
*       thread_fn, (void *)(long)i)) {
      perror("pthread_create");
      return -1;
    }

// Wait for threads to finish
  for (int i = 0; i < NOWORKERS; i++)
*   if(pthread_join(workers[i], NULL)) {
      perror("pthread_join");
      return -1;
  }

printf("All done\n");
}
```
.codelink[<a href="src/pthread.c" download>`03-shared-memory-programming/pthread.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/965QyfK0VoUyL" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]
]

---
# Threads in C/C++ with Pthread

---
# Threads in C/C++ with Pthread

---

# Threads in Java

```java
class MyThread extends Thread { // create a class that inheritates from Thread
    int id;
    MyThread(int id) { this.id = id; }
    // override the method run() to define the code executed by the thread:
    public void run() { System.out.println("Thread " + id + " running"); }
}

class Main {
    public static void main(String[] args) {
        int NOWORKERS = 5;
        MyThread[] threads = new MyThread[NOWORKERS];

for (int i = 0; i < NOWORKERS; i++)
            threads[i] = new MyThread(i);
        for (int i = 0; i < NOWORKERS; i++)
            threads[i].start(); // start executing the threads

for (int i = 0; i < NOWORKERS; i++)
            try { threads[i].join(); } catch (InterruptedException e) { /* do nothing */ }

System.out.println("All done");
    }
}
// compile and launch with:
//javac java-thread.java && java Main
```
.codelink[<a href="src/java-thread.java" download>`03-shared-memory-programming/java-thread.java`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/9hIKEyyUUqM8y" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]

???

- Let's see an example
- We have the MyThread class inheriting from Thread
- In the constructor we just initialized an identifier integer member variable
- And we implement the run method, it just prints the fact that the thread is running
- In the main function we create an array of objects from this class
- Each is given a different id
- Then we launch them all in a loop with start, and we wait for each to finish with join

---
# Threads in Python

```python
#!/usr/bin/python3

import threading

NOWORKERS=5

def print_hello(id):
    print("Thread " + str(id) + " running")

threads = []
for i in range(NOWORKERS):
    thread = threading.Thread(target=print_hello, args=(i,))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

print("All done.")
```
.codelink[<a href="src/python.py" download>`03-shared-memory-programming/python.py`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a>]
---

# Threads in Rust

```rust
use std::thread;

const NOWORKERS: u32 = 5;

fn print_hello(id: u32) {
    println!("Thread {} running", id);
}

fn main() {
    let mut handles = vec![];

for i in 0..NOWORKERS {
        let handle = thread::spawn(move || { print_hello(i); }); // `move ||` transfers
        handles.push(handle);                                    // ownership of i inside
    }                                                            // the closure

for handle in handles {
        handle.join().unwrap(); // unwrap panics if the thread's execution returns an error
    }

println!("All done.");
}
// With Rust toolchain installed, compile with:
// rustc <source file>
```
.codelink[<a href="src/rust.rs" download>`03-shared-memory-programming/rust.rs`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/3BTt0XLyjFTza" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]

---

# Examples Output

.leftcol[
For all examples:

```bash
Thread 1 running
Thread 0 running
Thread 2 running
Thread 4 running
Thread 3 running
All done
```

- **No control over the order of execution!**
  - The OS scheduler decides, it's nondeterministic
]

--
.rightcol[
A possible scheduling scenario:
<div style="text-align:center"><img src="include/scheduling.svg" width=400 /></div>
]
--
.rightcol[
Another one on 1 core:
<div style="text-align:center"><img src="include/scheduling2.svg" width=400 /></div>
]

???

- This is an example of the output we can get, for both the java and the C
  example
- If we run the program multiple times we get different order of execution
- This is normal, we have no control over this as it is the OS scheduler that
  decides
- Of course this can be problematic in some situations where we really need to
  set up a proper order of execution, for example when a thread needs to
  accomplish a certain task before another thread can start doing its job
- Later in the course we will see how to use synchronisation mechanisms to
  manage this

---
class: inverse, center, middle

# Data Parallelism

#### Dividing Work between Threads

---
# Data Parallelism

- Simple form of parallelism commonly found in many applications
  - Common in computational science applications
- Divide computation into (nearly) equal sized chunks
- Works best when there are **no data dependencies** between chunks

???
- As we saw previously, data parallelism is an easy form of parallelism found
  in many applications such as computational science
- There is data parallelism when we have some data that is structured in such
  a way that the operations to be performed on it can easily be parralelised
- This is very suitable for parallelism, the goal is to divide the computations
  into chunks, computed in parallel

- Exploited in multithreading but also at the instruction level in
  vector/array (SIMD) processors: CPUs (SSE, AVX), and in GPGPUs

???

- You can see an example on the slide in the form of a sum between two arrays
- We sum each element of same index in A and B and put the result in C
- For example we can have a different thread realise each addition, or fewer
  threads each taking care of a subset of the additions
- This kind of parallelism is exploited in vector and array processors
- General purpose CPU architectures have vector instructions extensions
  allowing things like applying the same operation on all elements of an array
- For example Intel x86-64 use to have SSE and now AVX
- From a very high level perspective, GPUs also work in that way

---
name: mat
# Data Parallelism Example

- **Matrix multiplication** is a good example
  - No dependencies between the computation of each member of the result matrix
???

- Let's see a second example more in details, matrix-matrix multiplication
- We'll use a square matrix for the sake of simplicity

---
template: mat

???

- Remember that with matrix multiplication, each element of the result matrix
  is computed as the sum of the multiplication of the elements in one column
  of the first matrix with the element in one line of the second matrix
- So how can we parallelise this operation?

---
template: mat
  - *n<sup>2</sup>* parallel threads (1 per result element)

???

- First we can have 1 thread per element of the result matrix, each thread
  computing the value of the element in question
- With a n x n matrix that gives us n squared threads

---
template: mat
  - *n<sup>2</sup>*parallel threads (1 per result element)
  - *n* parallel threads (1 per row/column of result)

???

- If we don't have a lot of cores we may want to create fewer threads, it's
  generally not very efficient to have more threads than cores
- So another strategy is to have a thread per row or per column of the result
  matrix

---
template: mat
  - *n<sup>2</sup>*parallel threads (1 per result element)
  - *n* parallel threads (1 per row/column of result)
  - *p* parallel threads, each computing *q* rows/columns of the result,
    where *pq = n*

???

- Finally, we can also create an arbitrary number of threads by dividing the
  number of elements of the result matrix by the number of threads that we want
  and have each thread take care of a subset of the elements

---

class: inverse, center, middle

# Implicit vs. Explicit Parallelism

---
#  Implicit vs. Explicit Parallelism

- When parallelising a program an important consideration is that of the programmer's **effort**:
  - **What is the best strategy according to the situation?**
      - Does the programmer need to be an expert to make that choice?
  - **How to indicate the chosen strategy in the code?**
      - If we already have a sequential version of the program, how much code refactoring & new code implementation is needed?

---
# NxN Parallel Matrix Multiplication

.leftcol[
```c
#define N 1000
int A[N][N]; int B[N][N]; int C[N][N];

int main(int argc, char **argv) {

/* init matrices here */

for(int i=0; i<N; i++)
    for(int j=0; j<N; j++) {
      C[i][j] = 0;
      for(int k=0; k<N; k++)
          C[i][j] += A[i][k] * B[k][j];
    }

return 0;
}

```
.codelink[<a href="src/matmult.c" download>`03-shared-memory-programming/matmult.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/22EWLSk12feq4" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]
]

.rightcol[

- Basic sequential code for matrix multiplication
]

---
# NxN Parallel Matrix Multiplication

.leftcol[
```c
*#include <omp.h>
#define N 1000
int A[N][N]; int B[N][N]; int C[N][N];

int main(int argc, char **argv) {

/* init matrices here */

*#pragma omp parallel
  {
    /* First loop parralelised */
    for(int i=0; i<N; i++)
      for(int j=0; j<N; j++) {
        C[i][j] = 0;
        for(int k=0; k<N; k++)
           C[i][j] += A[i][k] * B[k][j];
      }
  }

return 0;
}

```
.codelink[<a href="src/openmp.c" download>`03-shared-memory-programming/openmp.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/4VrRsIOiEGZc9" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]
]

.rightcol[

- Automatic parallelisation using OpenMP
  - Very low programmer effort
  - More on OpenMP later
- Compile and run: `gcc openmp.c -o prog -fopenmp && OMP_NUM_THREADS=8 ./prog`

**Parallelising with pthread would require much more code rewrite**
]

???

- So still on our matrix-matrix multiplication example, here is an example of
  a cool thing we can do in C with a library called open MP
- We can parallelise the matrix multiplication operation in a very simple way
  by adding a simple annotation in the code
- All the iteration of the first outer loop will be done in parallel by different
  threads
- The framework is taking care of managing the threads, and also to spawn the
  ideal number of threads according to the machine executing the program

---
# Explicit vs. Implicit Parallelism

- **Explicit parallelism**
  - The programmer explicitly spells out what should be done in parallel/sequence
  - Code modifications needed if sequential program already available
  - Examples: reworking a program to use threads or processes for parallel execution
???
- This notion of programmer's effort lead us to the concepts of explicit vs.
  implicit parallelism
- With explicit parallelism, the programmer has to write more or less code to
  indicate what should be done in parallel and what should be done ins sequence
  - This ranges from writing all the thread code manually, or just putting
    some annotations in sequential program s(OpenMP)
--

- **Implicit parallelism**
  - No effort from the programmer, system works out parallelism by itself
  - No code modification over an already existing sequential program
  - Done for example by some languages able to make strong assumptions about
    data sharing (e.g. pure functions), or with ILP

???

- And with implicit parallelism, the system work out the parallelism fully
  automatically
  - Some languages (e.g. functional languages) have different views of
    computation which allow a default assumption of parallelism (no updatable
    shared state between functions)

---
# Example Code for Implicit Parallelism

Fortran 77 sequential array addition:

```fortran
DO I = 1, N
    DO J = 1, N
      C(J, I) = A(J, I) + B(J, I)
    END DO
END DO
```

In Fortran 90 with implicit parallelism:

```fortran
C = A + B
```

---
# Automatic Parallelisation

- In an ideal world, all parallelism would be implicit: the compiler and/or the hardware would **take an ordinary sequential program and derive the parallelism automatically**
???
- So of course implicit parallelism is the best type of parallelism because the
  programmer does not have to do anything
- If we have a sequential program it would be great if the compiler can
  automatically extract all the parallelism!

- Manufacturers of pre-multicore parallel machines invested considerably in
    such technology
  - Can do quite well if the programs are simple enough but **dependency analysis
    can be very hard**
  - Must be conservative
      - If you cannot be certain that parallel version computes **correct**
        result, can't parallelise

???
- There was a lot of effort invested in such technologies at the time of the
  first parallel machines, before the multicore era
- It works well on small programs but in the general case, analysing dependencies
  in order to define what can be done in parallel and what needs to be done
  sequentially becomes very hard
- And of course the compiler needs to be conservative not to break the program,
  i.e. if it is not 100% sure that two steps can be run in parallel they need to
  run sequentially
- So overall performance gains through implicit parallelism are quite limited,
  and to get major gains one need to go the explicit way

---
# Example Problems for Parallelisation

- Can the compiler automatically parallelise the execution of these loops'
  iterations?
    - I.e. run all or some iterations in parallel

```c
for (int i = 0 ; i < n-3 ; i++) {
  a[i] = a[i+3] + b[i] ;           // at iteration i, read dependency with index i+3
}

for (int i = 5 ; i < n ; i++) {
  a[i] += a[i-5] * 2 ;             // at iteration i, read dependency with index i-5
}

for (int i = 0 ; i < n ; i++) {
  a[i] = a[i + j] + 1 ;            // at iteration i, read dependency with index ???
}
```

???

- Now here are a few examples of dependencies that a compiler may face when
  trying to extract parallelism automatically
- They all regard parallelising all of some iterations of a loop
- In the first loop 3 slots after the index
  computed at each iteration we have a read dependency
- In the second loop we have another read dependency, this time 5 slots before
  the index computed at each iteration
- Then in the last case, still a read dependency, we don't know if the slot
  read is before or after the one being computed
- This is quite important as we are going to see

---
# Automatic Parallelisation

.leftcol[
```c
for (int i=0; i<n-3; i++)
  a[i] = a[i+3] + b[i];
```
- If we parallelise all iterations and iteration 3 runs before iteration 0 we break the program
- Positive offset: we read at each iteration what was in the array before the loop started
- **We never read a value computed by the loop itself**

]

.rightcol[
<div style="text-align:center"><img src="include/dep-table.svg" width=450 /></div>
]

--
Can parallelise all iterations by making a new version of array `a`
```c
parrallel_for(int i=0; i<n-3; i++)
        new_a[i] = a[i+3] + b[i];
a = new_a;
```

???

- Let's have a look at the first example, it's the simplest
- Because regarding the dependency, the index read is superior to the index
  written, what we read at each iteration is what was in the array before we
  even started the loop
- At no point we read an index that have been computed by this loop
- However if we parallelise each iteration this could happen if let's say iteration
  3 runs before iteration 0
- So the solution is to internally make a new version of the array and read from
  an unmodified copy

---
# Automatic Parallelisation

.leftcol[
```c
for (int i = 5 ; i < n ; i++) {
      a[i] += a[i-5] * 2 ;
}
```
- Previous trick does not work here: **this time we read values computed by the loop itself**
- At each iteration i we read what was computed by the loop at iteration i-5

]

.rightcol[
<div style="text-align:center"><img src="include/dep-table2-white.svg" width=400 /></div>
]

---
# Automatic Parallelisation

]

.rightcol[
<div style="text-align:center"><img src="include/dep-table2.svg" width=400 /></div>
]

???

- Now with a dependency such as this one, the trick from the previous slide
  won't work!
- Because what we read at iteration i is supposed to have been written 5
  iteration before, we can't rely on a read-only copy
- Also, parallelising all iteration will break the program
- We observe that at a given time, there is no dependency between sets of
  5 iterations, such as the five first ones.
- The solution here is to limit the parallelism to 5

---
# Shared Memory

- For this lecture we assumed that threads **share memory**
  - I.e. they all have access to a common address space
  - Multiple threads can read and write in the same memory location (variable,
    buffer, etc.) through
      - Global variable
      - Pointers and references  
--
- No shared memory?
  - Need to communicate via **message passing**
  - Close to a distributed system
  - We'll talk briefly about MPI (Message Passing Interface) later in the course unit

???
- One last thing
- Everything in this lecture has been on the basis that threads share memory
- In other words they can all access the same memory, and they will all see
  the same data at a given address
- Language scope still apply, so threads can declare local variables that are
  private, but shared stuff is accessible from every thread
- If you don’t have shared memory, threads have to communicate via messages –
more like a distributed system. We will talk about MPI in another lecture

---
# Summary and Next Lecture

- **In shared memory systems, a parallel program can use threads**
  - Threads communicate implicitly by reading and writing to a common address
    space
- **A simple form of parallelism: data parallelism**
  - Apply similar operations on chunks of a data set
  - Efficient when there is no data dependency
- **Automatic parallelisation can be limited by such dependencies**
- Shared memory is a practical form of implicit communication
  - It's great that multicore today share memory right?
  - Next lecture will discuss why this isn’t as simple as it sounds!