COMP35112 - Synchronisation in Parallel Programming

class: center, middle
background-image: url(include/title-background.svg)
# COMP35112 Chip Multiprocessors
<br/>
<br/>
# .white[Synchronisation in Parallel Programming - Locks and Barriers]

.white[Pierre Olivier]

???

- So in the past lectures we have seen how to program a shared memory processor
  using threads, execution flows that share a common address space which is
  very practical for communications
- However the examples we have seen were extremely simple, and this is also true
  for the first lab exercise you are currently working on
- The reason I say they are simple is that first threads do not need to sync up
  at any point, apart from the final join
- Furthermore, there isn't much shared data as threads work on subsets of the
  arrays that are accessed only by one thread, so things are easy.
- Today we are going to see mechanisms that make synchronisation and data
  sharing (especially in write mode) possible

---
# Synchronisation Mechanisms

- In a multithreaded program, **threads rarely execute completely independently**:
   - In many scenarios they need to wait for each other and to communicate
     by accessing shared data
--
   - Brings the need for **synchronisation mechanisms**
--

- In this lecture we will cover:
  - **Barriers**: simple mechanism letting threads wait for each other at a given point in their execution
  - **Locks** allowing threads to safely access shared data
- In the next video you will see:
  - **Condition variables** for event signalling between threads

???

- So as I said we have seen very simplistic examples of multithreaded programs
- But in reality, in a traditional program the threads are rarely completely
  independent: they need to wait for each other at particular points during
  the computations, they need to communicate and share data, in particular
  in write mode
- To allow all of this we use synchronisation mechanisms, maybe you have
  already heard of things like barriers, locks, semaphores, etc.
  - These are software constructs that allow things like threads waiting for
    each others, data sharing and communications
- In this lecture we'll cover barriers and locks

---
class: inverse, center, middle

# Barriers

???

- Let's start with barriers

---
# Barriers

- **A barrier is where a number of threads agree to "meet up"**
  - When all threads have reached it, they can all proceed

???

- A barriers allows selected threads to meet up at a certain point in the
  program execution
- Upon reaching the barriers, all threads wait until the last one reaches it
- In this example we have the time flowing horizontally
- Thread 2 reaches the barrier first, and it starts waiting for all the other
  threads to reach that point
- Thread 1, then 3 do the same
- When thread 4 reaches the barriers, it's the last one so all threads resume
  execution

---
# Barriers

- Very natural when threads are used to implement data parallelism
    - Want the whole answer from step *n* before proceeding to step *n+1*
        - E.g. stream of vectors to addition in parallel, 2 at a time

---
# Barriers

- Would also use when data dependence limits loop parallelisation
  - Many barriers primitives allow multiple use

???

- Barriers are useful in many scenarios
- For example with data parallelism, assuming an application is composed of
  multiple phases or steps
- An example could be a first step in which threads first filter the input
  data, based on some rule, and then in a second step the threads perform some
  computation on the filtered data
- We may want to have a barrier to make sure that the filtering step is done
  before starting the computing step
- Another use case is when, because of data dependencies, we can parallelise
  only a subset of a loop's iterations at a time
- Remember the lecture from 2 weeks ago
- We can put a barrier in a loop to ensure that all the parallel iterations in
  one step are computed before going to the next step
- This barrier provided by java can be reused so it's OK to have it within a
  loop

---
# Barrier Example

We'll write a small program with 2 threads executing as follows:

- Force thread 2 to execute for a longer time than thread 1 at each
  iteration

---
# Barrier Example

We'll write a small program with 2 threads executing as follows:

- Force thread 2 to execute for a longer time than thread 1 at each
  iteration

---
# Barrier Example

We'll write a small program with 2 threads executing as follows:

- Force thread 2 to execute for a much longer time than thread 1 at each
  iteration

???

- So let's have a look at an example of barrier
- We'll create 2 threads
- What they do is that they perform some kind of computations (green part)
- Then they reach the barrier, and they print the fact that they have reached
  the barrier
- We'll make sure that the amount of computations in one thread is much larger
  than the amount in the other thread, so we should see one thread printing
  "I have reached the barrier first", then there should be some time before
  the other thread prints the fact that it reached the barrier. At that point
  both threads should resume execution and both will print out the fact that
  they are past the barrier
- And we rinse and repeat that in a loop

---
# Barrier Example

.leftcol[
```c
#define ITERATIONS      10

typedef struct {
    int id;
    int spin_amount;
    pthread_barrier_t *barrier;
} worker;

```
]

.rightcol[
```c
void *thread_fn(void *data) {
  worker *arg = (worker *)data;
  int id = arg->id;
  int iteration = 0;

while(iteration != ITERATIONS) {

/* busy loop to simulate activity */
    for(int i=0; i<arg->spin_amount; i++);

printf("Thread %d reached barrier\n", id);

int r = pthread_barrier_wait(arg->barrier);
    if(r!=PTHREAD_BARRIER_SERIAL_THREAD && r) {
      perror("pthread_barrier_wait");
      exit(-1);
    }

printf("Thread %d passed barrier\n", id);
    iteration++;
  }
  pthread_exit(NULL);
}

```
.codelink[<a href="src/barrier.c" download>`08a-locks-barriers/barrier.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/8HdPzfiDtLTPk" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]

]

---
# Barrier Example

```c
#include <err.h> // for errx
/* ... */
#define T1_SPIN_AMOUNT  200000000
#define T2_SPIN_AMOUNT  (10 * T1_SPIN_AMOUNT)

int main(int argc, char **argv) {
    pthread_t t1, t2;
    pthread_barrier_t barrier;

worker w1 = {1, T1_SPIN_AMOUNT, &barrier};
    worker w2 = {2, T2_SPIN_AMOUNT, &barrier};

if(pthread_barrier_init(&barrier, NULL, 2))
        errx(-1, ("pthread_barrier_init"));      // equivalent of perror(msg); exit(-1);

if(pthread_create(&t1, NULL, thread_fn, (void *)&w1) ||
            pthread_create(&t2, NULL, thread_fn, (void *)&w2))
        errx(-1, "pthread_create");

if(pthread_join(t1, NULL) || pthread_join(t2, NULL))
        errx(-1, "phread_join");

return 0;
}
```
.codelink[<a href="src/barrier.c" download>`08a-locks-barriers/barrier.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/8HdPzfiDtLTPk" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]

???

- This is the main method
- We define the spinning time of thread 2 as 4 times the spinning time of
  thread 1
- We create the barrier, indicating that it will concern 2 threads 
- We create the threads and launch them, and join

---
class: inverse, middle, center

# Locks

???

- Now let's talk about locks

---
# Locks: Motivational Example 1

- **Locks protect shared data from concurrent access**
  - Why is it needed?
- Motivational example: withdrawing cash at a cash machine

Cash machine pseudo code for the withdrawal operation:
```c
int withdrawal = get_withdrawal_amount(); /* amount the user is asking to withdraw */
int total = get_total_from_account();     /* total funds in user account */

/* check whether the user has enough funds in her account */
if(total < withdrawal)
    abort("Not enough money!");

/* The user has enough money, deduct the withdrawal amount from here total */
total = total - withdrawal;
update_total_funds(total);

/* give the money to the user */
spit_out_money(withdrawal);
```

???

- So we need locks to protect data that is shared and that can be accessed in
  write mode by at least 1 thread
- Let's motivate why this is very important
- Let's a assume that we have a cash machine, which supports various operations
  and among them cash withdrawal by a client
- This is the code for the withdrawal function
- First the cash machine queries the bank account to get the amount of money
  in the account
- It also gets the amount the user wants to withdraw from some input
- The machine then checks that the user has enough money to satisfy the request
  amount to withdraw
- If not it returns an error
- If the check passes,  the machine compute the new value for the account
  balance and updates it, then spits out the money

---
name:atm
# Locks: Motivational Example 1

.leftcol[
```c
int withdrawal = get_withdrawal_amount();
int total = get_total_from_account();

if(total < withdrawal)
    abort("Not enough money!");

total = total - withdrawal;
update_total_funds(total);

spit_out_money(withdrawal);
```
]

???

- Now this all seems fine when there is only one machine
- Let's see what happens when concurrency comes into play
- I put a summarised version of the code on the left of the slide, it's exactly
  the same as on the previous slide

---
template:atm

.rightcol[
- Assume 2 transactions are happening nearly at the same time
  - E.g. shared credit card account
- Assume: `total == 105`, `withdrawal1 == 100`, `withdrawal2 == 10`
  - At least one should fail as `(100+10) > 105`
]

???

- Let's assume we now have 2 transactions happening at the same time from 2
  different cash machines
- This could happen in the case of a shared bank account for example
- We assume that there is 105 pounds on the account at first, and that
  the first transaction is a 100 pounds withdrawal while the second is a 10
  pounds withdrawal
- One of these transactions should fail because there is not enough money to
  satisfy both

---
template:atm

.rightcol[.medium[
- `total == 105`, `withdrawal1 == 100`, `withdrawal2 == 10`
- A possible scenario:
  1. Cash machines (CMs) retrieve `total` locally, both read `105`
  2. CM 1 checks that `100 < 105` and CM2 checks that `10 < 105`
      - All good
  3. CM 1 updates:
      - `total = 105 - 100 = 5`
      - `update_total_funds(5)`
  4. Slightly later CM 2 updates:
      - `total = 105 - 10 = 95`
      - `update_total_funds(95)`
]]

???

- So we have 105 pounds in the account, and two withdrawals, one of 100 the
  other of 10
- This is a possible scenario
- Both threads get the total in their local variable, both get 105
- Both threads perform the balance check against the withdrawal amount, both
  pass because 100 < 105 and 10 < 105
- Thread 1 then updates the account balance with 105-100 = 5 and spits out the
  money
- Then thread 2 updates the account, with 105 - 10 = 95 and spits out the money

**Total withdrawal: 110, and there is 95 left on the account**, <br/>Yay free money 🤑
- **Race condition:** concurrent operations on shared state should not happen at the same
  time

???

- So in total not only we have withdrew more money than what there was on the
  account, but also there is still 95 pounds on the account, this is free money
- Of course this behaviour is incorrect
- It is called a race condition, when shared data (here the account balance)
  is accessed in write mode by at least 1 thread (here it is accessed in write
  mode by both cash machines i.e. threads)
- We need locks to solve that issue, to protect the shared data against races

---
name:lock2
# Locks: Motivational Example 2

- Consider the `i++` statement
  - Could translate into machine code as:

```bash
1. Load the current value of i from memory into a register
2. Add one to the value stored into the register
3. Store from the register to memory the new value of i
```

???

- Now let's take a second, more low level example
- Consider the i++ statement in a language like java or C
- Let's assume that the compiler or the JV is transforming this statement into
  the following machine instructions
- First load i into a register
- Then increment that register
- And then store the new value back in memory

---
template:lock2

.leftcol[
Assume `i` is a global variable, **when 2 threads execute `i++`, we expect `i` to end up being incremented twice**.    
A possible scenario:
]
.rightcol[
<div style="text-align:center"><img src="include/mot-table1.svg" width=400 /></div>
]

???
- If we have 2 threads doing the ++ operation on a shared variable we can have
  the following scenario
- Thread 1 loads i in a register, it's 7, then increments it, it becomes 8, and
  then stores 8 back in memory
- Next, thread 2 loads i, it's 8, increment it to 9, and stores back 9
- So far so good

---
template:lock2

.rightcol[
<div style="text-align:center"><img src="include/mot-table2.svg" width=400 /></div>
]
.leftcol[
Another possible scenario: **race condition 🔥**
- Need **locks** to arbitrate access to shared data
]

???

- But here is another possible scenario
- Both threads load 7 at the same time
- Then they increment the local register, it becomes 8 in both cores
- And then they both store 8 back
- This is incorrect behaviour, once again a race condition because they are
  accessing a shared variable in write mode without proper synchronisation
- Note that this can also happen on a single core, according to how the
  scheduler order the operations between both threads
- Once again we need a lock

---
## Critical Sections, Serialisation, Atomicity

.leftcol[
- Bits of code in our program where shared data is accessed/updated are
  called **critical sections**
]

.rightcol[
```c
int withdrawal = get_withdrawal_amount();
*int total = get_total_from_account();
*if(total < withdrawal)
*    abort("Not enough money!");
*total -= withdrawal;
*update_total_funds(total);
spit_out_money(withdrawal);
```
]
--

- To avoid races, the execution of critical sections must done according to these properties:
  - **Serialisation**: limit the amount of threads that can concurrently execute it (generally 1)
  - **Atomicity**: when a thread T starts to run a critical section S, T must
    first finish S before another thread can enter S

> ***Locks* are synchronisation primitives that enforce the serialisation and atomicity of critical sections**

???

- So the bits of code in a parallel program that are accessing shared data are
  called critical sections
- A lock is a synchronisation mechanism that allows to avoid race condition,
  by enforcing some rules for the execution of critical section
- First, it defines how many threads can run a critical section concurrently,
  generally it is one, this is called serialisation: we prevent parallelism
  for the critical section
- Second, locks allow to make sure that the critical section are executed
  atomically, in other words that once a thread starts to run the critical
  section, it first finishes before another thread can enter it

---
# Locks

- Each piece of shared data is protected by a dedicated lock
- Critical section relevant to this piece of shared data are surrounded by `lock` and `unlock` operations on that lock:

```c
int withdrawal = get_withdrawal_amount();

*lock(account_lock);                   // begin critical section

int total = get_total_from_account();
if(total < withdrawal)
    abort("Not enough money!");
total -= withdrawal;
update_total_funds(total);

*unlock(account_lock);                 // end critical section

spit_out_money(withdrawal);
```

---
name: locks

# Locks

- Critical sections can be protected with
  **locks**:
  - **Only one thread can hold a given lock at a time** (serialisation)
    - **Holding the lock lets the thread enter and execute the corresponding critical section in its entirety** before another thread can get the lock and run the critical section (atomicity)
--
- A thread wishing to enter the critical section **tries** to take the lock:
  - A thread attempting to take a free lock will get it
  - If a lock is not free the thread needs to wait until the lock is released by its holder, then try again to take the lock

---
template: locks
<div style="text-align:center"><img src="include/lock1.svg" width=800 /></div>

---
template: locks
<div style="text-align:center"><img src="include/lock3.svg" width=800 /></div>

---
template: locks
<div style="text-align:center"><img src="include/lock4.svg" width=800 /></div>

---
template: locks
<div style="text-align:center"><img src="include/lock5.svg" width=800 /></div>

---
# Pthreads Mutexes

- **Mutexes: mutual exclusion locks**

```c
#include <pthread.h>

pthread_mutex_t mutex;

void my_thread_function() {

pthread_mutex_lock(&mutex);

/* critical section here */

pthread_mutex_unlock(&mutex);

}

```

**Important:  
in the next examples we omit pthread functions return code
checking for the sake of brevity, but note that almost all of them can fail!**

---
# Lock Usage Example

- **Bounded buffer**: circular fixed-size FIFO producer-consumer buffer
<div style="text-align:center"><img src="include/bounded-buffer.svg" width=600 /></div>

???

- We are going to define the following data structure named bounded buffer
- It's a fixed size buffer in which can be accessed concurrently by multiple
  threads
- It's a FIFO producer consumer buffer
- Threads can deposit data in the buffer
- And thread can also extract data
- This is done in a first in first out fashion

---
# Lock Usage Example

```c
typedef struct {
    int *buffer;            // the buffer (array of "messages" i.e. integers)
    int max_elements;       // size of the buffer
    int in_index;           // index of the next free slot
    int out_index;          // index of the next message to extract
    int count;              // number of used slots
    pthread_mutex_t lock;   // lock protecting the buffer from concurrent accesses
} bounded_buffer;

int init_bounded_buffer(bounded_buffer *b, int size) {
    b->buffer = malloc(size * sizeof(int));
    if(!b->buffer)
        return -1;

b->max_elements = size;
    b->in_index = 0;
    b->out_index = 0;
    b->count = 0;
    pthread_mutex_init(&b->lock, NULL); // mutex initialisation
    return 0;
}

void destroy_bounded_buffer(bounded_buffer *b) {
    free(b->buffer);
}
```
.codelink[<a href="src/lock.c" download>`08a-locks-barriers/lock.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/6m6wuLpfHzXXP" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]

???

---
# Lock Usage Example (continued)

```c
void deposit(bounded_buffer *b, int message) {
    pthread_mutex_lock(&b->lock);

int full = (b->count == b->max_elements);

while(full) {
        // buffer is full, can't deposit! Release the lock and wait a bit
        // to give another thread a chance to extract an element
        pthread_mutex_unlock(&b->lock);
        usleep(100);
        pthread_mutex_lock(&b->lock);

// is the buffer still full?
        full = (b->count == b->max_elements);
    }

// perform deposit
    b->buffer[b->in_index] = message;
    b->in_index = (b->in_index + 1) % b->max_elements;
    b->count++;

pthread_mutex_unlock(&b->lock);
}
```
.codelink[<a href="src/lock.c" download>`08a-locks-barriers/lock.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/6m6wuLpfHzXXP" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]

---
# Lock Usage Example (continued)

```c
int extract(bounded_buffer *b) {
    pthread_mutex_lock(&b->lock);

int empty = !(b->count);

while(empty) {
        // buffer is empty, nothing to extract! Release the lock and wait a bit
        // to give another thread a chance to deposit an element
        pthread_mutex_unlock(&b->lock);
        usleep(100);
        pthread_mutex_lock(&b->lock);

// is the buffer still empty?
        empty = !(b->count);
    }

// perform extract
    int message = b->buffer[b->out_index];
    b->out_index = (b->out_index + 1) % b->max_elements;
    b->count--;

pthread_mutex_unlock(&b->lock);
    return message;
}
```
.codelink[<a href="src/lock.c" download>`08a-locks-barriers/lock.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/6m6wuLpfHzXXP" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]

???

---
# Lock Usage Example (continued)

.leftcol[
```c
typedef struct {
    int iterations;
    bounded_buffer *bb;
} worker;

void *deposit_thread_fn(void *data) {
  worker *w = (worker *)data;
  for(int i=0; i<w->iterations; i++) {
    deposit(w->bb, i);
    printf("[deposit thread] put %d\n", i);
  }
  pthread_exit(NULL);
}

void *extract_thread_fn(void *data) {
  worker *w = (worker *)data;
  for(int i=0; i<w->iterations; i++) {
    int x = extract(w->bb);
    printf("[extract thread] got %d\n", x);
  }
  pthread_exit(NULL);
}
```
]

.rightcol[
```c
#define BUFFER_SIZE 100
int main(int argc, char **argv) {
  bounded_buffer b;
  pthread_t t1, t2;

if(init_bounded_buffer(&b, BUFFER_SIZE)) {
    /* error */
  }

worker w1 = {BUFFER_SIZE*2, &b};
  worker w2 = {BUFFER_SIZE*2, &b};

pthread_create(&t1, NULL, deposit_thread_fn,
    (void *)&w1);
  pthread_create(&t2, NULL, extract_thread_fn,
    (void *)&w2);

pthread_join(t1, NULL);
  pthread_join(t2, NULL);
  destroy_bounded_buffer(&b);
  return 0;
}
```
.codelink[<a href="src/lock.c" download>`08a-locks-barriers/lock.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/6m6wuLpfHzXXP" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]

]

---
name: race
## Omit the Locks and ... Madness Ensues 🙃

.leftcol[
- Could have two threads in `deposit()` both writing to the same element of
  buffer
  - One value is lost
- Could then either increment `in_index`
    - Once: whole call of `deposit()` lost
    - Twice: spurious (old) value apparently deposited
]

---
template: race

.rightcol[
<div style="text-align:center"><img src="include/bounded-buffer-race1.svg" width=400 /></div>
]

---
template: race

.rightcol[
<div style="text-align:center"><img src="include/bounded-buffer-race2.svg" width=400 /></div>
]

---
template: race

.rightcol[
<div style="text-align:center"><img src="include/bounded-buffer-race3.svg" width=400 /></div>
]

---
template: race

.rightcol[
<div style="text-align:center"><img src="include/bounded-buffer-race4.svg" width=400 /></div>
]

---
template: race

.rightcol[
<div style="text-align:center"><img src="include/bounded-buffer-race5.svg" width=400 /></div>
]

---
template: race

- Similarly for two concurrent calls to `extract()`
- As well as between concurrent calls to `deposit()` and `extract()`

.center[**Concurrency issues can be *extremely* hard to debug in medium/large
scale programs**]

???

- Now if you do not protect critical sections a lot of bad things can happen
- First the program may seem to behave normally on small examples, especially
  when the number of threads is low or when the frequency of access to shared
  data is low
  - This is bad because it's hiding race conditions
- If two threads call deposit at the same time, they may write to the same
  slot in the buffer, one value being lost
- When they increment the index for the next insertion, it can either be
  increment only by one assuming a similar scenario as for the buffer content
  - in that case we lose one of the inserted values
- They can also increment the index twice and assuming we got the content overwrite
  problem we have a slot containing the old value
- We can have similar issues in case of unprotected concurrent calls to deposit
- And additional issues in case of unprotected concurrent calls to deposit and
  extract at the same time, for example as they both update the number of used
  slots we may loose consistency for that value
- This may manifest in a number of ways in the program behaviour. Sometimes
  the program can even seem to work fine.
- As a result concurrency issues can be extremely hard to reproduce and debug
  in large program and it's important to get your locking strategy right
  from the start

---
# Omit the Locks and ...

```c
/* BUGGY version of deposit, without locks */
void deposit(bounded_buffer *b, int message) {

while (b->count == b->max_elements);

b->buffer[b->in_index] = message;
    b->in_index = (b->in_index + 1) % b->max_elements;
    b->count++;
}

/* BUGGY version of extract, without locks */
int extract(bounded_buffer *b) {

while (!(b->count));

int message = b->buffer[b->out_index];
    b->out_index = (b->out_index + 1) % b->max_elements;
    b->count--;

return message;
}
```
.codelink[<a href="src/lock-buggy.c" download>`08a-locks-barriers/lock-buggy.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/1HdPzyE5lLOgI" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]

???

- Let's see what happen when we omit synchronised
- We remove the wait and notify operations, and replace the checks with
  spinning loops

---
# Summary

- Thread need to sync up and access shared data
  - Need for **synchronisation mechanisms such as locks and barriers**
- Parts of the program where a shared resource may be accessed by multiple threads at the same time,
    including at least a write access are **critical sections**
  - Locks enforce serial and atomic execution of these critical sections to avoid **race conditions**
- Next lecture: **condition variables**, another mechanism useful
  for event signalling