COMP35112 - Hardware Support for Synchronisation

class: center, middle
background-image: url(include/title-background.svg)
# COMP35112 Chip Multiprocessors
<br/>
<br/>
# .white[Hardware Support for Synchronisation]

.white[Pierre Olivier]

???
- Hi everyone
- So previously in the course we have seen how synchronisation is crucial
  to parallel programming
- In this video we are going to see what are the hardware mechanisms on modern
  CPUs that make synchronisation mechanisms possible

---
# Implementing Synchronisation

- Shared-memory programming requires synchronisation mechanisms to protect
  shared data
???
- So we have seen that when programming a multithreaded application, that
  relies on shared memory for communication between threads, synchronisation
  mechanisms such as locks are required to protect critical section so that
  we can properly reason about the program's behaviour

- Mechanisms can take several forms
  - But all are closely related and most can be built from the others
???
- There are multiple mechanisms, but they are all closely related and most can
  be built on top of others
  - For example we can easily build a barrier with a lock
- Now we have also seen that source code statement such as i++ can be
  translated by the compiler into several instruction, so they are not executed
  atomically thus using them alone is unfit for critical sections
- How can we be sure that the lock and unlock statements will be executed
  atomically
- After all, if you look at the lock operation, it can involves 2 memory
  accesses, checking if the lock is available (checking a memory slot) and
  taking the lock (writing in the memory slot)
--

- **Their implementation usually requires hardware support**
???
- The implementation of the lock and unlock operation requires hardware support
  so that they can be atomic
--

- We'll have a look at one of the simplest constructs
  - A lock called **binary semaphore**, in a processor with a snoopy cache
      - Can be held by at most 1 thread
      - Waiting threads use busy-waiting

???

- We will illustrate this with the simplest lock, a binary semaphore
- Only a single thread can hold it
- The waiting threads will spin
- And we'll do that in a CPU with a snoopy cache

---
# Example: Binary Semaphore

- It's a single shared "boolean" variable **`S`** which value is used to protect
  a shared resource
  - `S == 0` ➡ lock is available to take, resource is free
  - `S == 1` ➡ lock is taken, resource is in use
???
- So the binary semaphore is a single shared boolean variable in memory
- When it is one the protected resource is free, i.e. the lock is ready to be
  taken
- When it is 0 the resource is busy and the lock is taken
--

- Semaphore operations (should be **atomic**)
  - **`wait(S)`**: wait until `S == 0` then set `S = 1` (i.e. take the lock)
  - **`signal(S)`**: set `S = 0` (i.e. release the lock)

???

- So we have the lock and unlock operation
- Lock can also be called wait
- It consists in waiting for  the lock to be free, i.e. for the variable to be
  different from 0, then setting it to 0 to indicate that we take the lock
- Note that this should be atomic, if two threads see that the lock is free
  and sets the variable at the same time, things won't work
- Then we have the unlock operation, we release the lock by setting the
  variable to 1
- This one should also be atomic, a priori that's easier it can be translated
  to a simple store but even in that case we need to be careful, what if the
  variable is on 64 bits and we are on a 32 bit CPU, the compiler may
  translate S=1 into two load operations!

---
# Semaphore Usage to Protect Critical Sections

- **Critical sections** are the code sections where shared resources are
  manipulated

- `S` should be initialised as `0` (i.e. lock is initially free)

???

- So remember that critical sections are the parts in your code that access
  shared data and you want them to be executed in a serial fashion, i.e by 1
  thread at a time, and atomically, i.e. when 1 thread starts to execute the
  critical section, it should finish before another thread can enter that
  critical section
- We can achieve that with the lock
- We initialise the lock to 1, i.e. ready to be taken
- We take the lock before the critical section, we know that only one thread
  will atomically succeed and set it to 0, the others will wait on the lock,
  for it to go back to 1
- Which is the case when the first thread releases the lock
- So we have the critical section executed serially and atomically

---
# Atomicity Needed

- How to implement `wait(S)`?

???
- So how can we implement the lock taking operation, wait?

.leftcol[
```c
// naive implementation in C:
while(S == 1); // spin while lock is not free
S = 1;         // take the lock
```
]

???

- Intuitively we would do something like that in the source code right
- While the semaphore value is 0 we spin waiting
- And once it reaches 1 we set it to 0 to indicate that we got the lock

--
.rightcol[
```c
// address of `S` in `%r2`
*loop: ldr %r1, %r2
*     cmp %r1, $1
*     beq loop
      str $1, %r2
```
]

???

- Well that does not work because once this is compiled it's not atomic at
  the machine code level
- It is compiled into something like that
- We assume that the address of the lock variable is in r2
- We load it in a register, r1 with a lock operation
- Then we compare it to 0 and if it is equal to 0 we loop
- Otherwise we store 0 in the variable

.leftcol[
- What if another thread changes the value of `S`?

<div style="text-align:center"><img src="include/scenario.svg" width=250 /></div>
]

???

- And in that situation, atomicity is not guaranteed
- A scenario like this can happen
- First thread 1 does the load and thread 2 does the same right after
- Both do the comparison, see that it's not 0 so they don't branch, and then
  they both do the store

.rightcol[
**Both threads got the lock!**

The lock itself is a shared data structure...  
  
  ]

???

- Both have taken the lock! that's bad

---
# Atomic Instructions

- We need to ensure that the execution of `wait()` is "indivisible"
  - I.e. it should be **atomic**: once a thread starts to execute `wait()` it
    should first finish it before any other thread can start it
???
- So it is clear that wait must be atomic, i.e. the entire lock taking
  operation must be done at once
--
- Requires special instructions to be supported in hardware: **atomic instructions**
  - Special CPU instructions that realise a few operations atomically
  - Operations are generally a memory load, a comparison, and possibly a memory write
???
- This require special instructions supported by the hardware, called atomic
  operations
--

- Implementing synchronisation primitives like `wait()` with these instructions
  involves a compromise between complexity and performance
???
- There are various way for the hardware to implement these operations, with
  different performance/complexity tradeoffs
--

- Also note that variable `S` when accessed will be cached, possibly in several core caches, and the desired atomic behaviour
  might require coherence operations in the cache

???

- Also note that the implementation of these operations needs to take into
  account the fact that variable S may be cached in different caches

---
# Atomic Test-And-Set Instruction

- Simple solution in older CPUs, e.g. Motorola 68K

```c
tas %r2
```

- If memory location addressed by `%r2` contains `0`, switch its content to `1`
and set the CPU "zero" flag, otherwise clear zero flag:

```c
if ((*%r2) == 0) {
  (*%r2) = 1;
  zero_flag = 1;
} else {
  zero_flag = 0;
}
```
???
- A classic example of atomic instruction is test-and-set
- It is present in many CPUs
- It takes a register containing an address at parameter and performs,
  atomically, the following things
- It checks that the memory slot addressed by the register contains 0, and if it is
  the case, it sets the content of that memory slot to 1 as well as the CPU zero flag.
- If the memory slot was not 0, the zero flag is cleared
- Here is an example of usage in pseudo assembly, we just do test and set (TAS) on a memory slot
  which address is present in the register r2

- **Instruction-level behaviour is atomic:** cannot be interrupted, and no other core can modify what is pointed by `%r2` in memory while the `tas`
    runs
???
- This is atomic because the test and set guarantees that only one thread will
  be able to read and possibly modify the lock variable at a time
---
# Atomic Test-And-Set Instruction

<div style="text-align:center"><img src="include/tas1.svg" width=760 /></div>
???
- Let's illustrate an example
- Say we want to do a test and set on address 0x42
- We place that address in a register, say `%r2`
- And we do `tas %r2`
- Now assume address 0x42 contains 0
- Test and set will then switch it to 1, and will also set the zero flag of the
  core to 1
--
<div style="text-align:center"><img src="include/tas2.svg" width=760 /></div>
???

- Now on the contrary if before the test and set address 0x42 contains something
  else than 0, for example 1
- Test and set won't touch it and will clear the zero flag on the core
---
# Our Semaphore with `tas`

- Remember that for our semaphore:
  - Lock is free when `S == 0`
  - Lock is taken when `S == 1`

- How to implement `wait()` and `signal()` with test-and-set?

.leftcol[
**Wait** (taking the lock):
```c
// Address of S in %r2
// Loops (i.e. wait) while [%r2] != 0
loop: tas %r2
      bnz loop // branch if zero flag not set
```
]

.rightcol[
**Signal** (releasing the lock):
```c
// We assume that basic store operations
// are atomic
// Address of S in %r2
str $0, %r2
```
]

???

- Now how can we implement our semaphore with the test and set atomic
  instruction?
- We have our lock S in memory at a given address
- The operation of attempting to take the lock, wait, is the the left
- We have the address of our lock in `%r2`
- We try a test and set
- If the lock's value is 1, we consider that the lock is not free
- In that case the test and set will not modify the lock's value and will clear
  the zero flag
- So the branch if not zero will go back to the loop label and we try again, i.e.
  we spin, until the lock is free
- When the lock is free it's value will be 0 so test and set will set it to 1,
  we won't branch and we'll continue with the lock held
- The lock release operation is on the right
- We assume here that the store assembly instruction is atomic
- Releasing the lock simply consist in setting the value of the lock to 0 to
  indicate that it is free to take
- Note that we inverted the meaning of the lock's values that we gave previously
- With test and set, 0 means the lock is free, and 1 means it is not
- The lock also needs to be initialised to 0

---
# What About the Cache?

- Semaphore operation with test-and-set is reasonably obvious if S is a single
  shared variable **in memory**
- Just `tas` for `wait()` and `str` for `signal()` on that single variable
???
- So if we assume a system with no cache the semaphore operations with test and
  set are pretty straightforward
- S is a single shared variable in memory, and it is locked with test and set
  and released with a simple store
--

- `tas` is an atomic **read-modify-write** (RMW) instruction  and as such it is expensive:
  - Involves up to 2 memory accesses (one load and possibly a store)
  - Locks the access to memory from other processors to ensure atomicity
???
- One thing to note is that operations such as test and set are classified as
  read-modify-write because this are the three operation they do atomically,
  and for these to work the CPU prevents access to memory from the other cores
  during the execution of the RMW operation by a core

- By definition `S` is shared: this is the fundamental purpose of a
  semaphore
  - **Processors are therefore likely to end up with a copy of `S` in their
    cache**

???
- Now, in a real CPU with a cache, because of the shared nature of S, it is
  likely that S will be cached
- so how does that work?

---
# Test-and-Set and the Cache

- Assume shared variable `S` in the cache
- Only when a `tas` succeeds (reads 0, lock is free) it must then write a 1 (take the lock)
???
- So if the shared variable is in the cache
--

- When `tas` starts, don't know if a write will be needed or not...
  - ... if it is the case, will need to send an invalidate message to other cores
???
- TAS reads from the cache and if it sees the variable is 0 it must then
  perform a write, so it needs to send an invalidate message to other cores
--

- **So the processor must 'lock' the snoopy bus for every multiprocessor `tas`
    operation**
     - Cannot let any other core do a write
???
- So for all of this to be atomic, the core needs to lock the snoopy bus for
  the duration of the test and set
- This basically prevent all the other cores to do a write in case they would
  try to write on the shared variable
- Of course this affects performance
--
     - But if it ends up reading a '1' (lock not available), this locking
       of the bus was wasted because the `tas` was read-only...

???

- And also note that in the case the lock is taken, i.e. if the TAS reads a 1,
  locking the bus was wasted

---
# Test-and-Set and the Cache

- Assume one thread has the lock: `S` is busy
- Another wanting the semaphore will read this busy value and cache it
- It will then sit in a loop continually executing a `tas` until `S` becomes free
???
- Assume one thread has the lock and another thread tries to take it, so the
  the other core is going to cache the value and it's going to spin in a loop
  continually executing the test and set until the lock becomes free
--

- **All this time it will be wasting bus cycles**
  - Slowing down cache coherence traffic from other cores
???
- Given that test and set locks the snoopy bus and prevent writes from all
  cores this seriously impacts performance
--

- Can address that issue with a simple re-formulation of the wait operation:
  **test-and-test-and-set**
  - Tries to minimise the amount of costly test-and-set

???
- We can avoid this issue with a trick, a particular way to use test and
  set, called test and test and set

---
# Test-and-test-and-set

- How to implement `wait()` with test-and-test-and-set?

.leftcol[
- In pseudo-code:

```c
do {
  while(S == 1);          // traditional load
} while (test-and-set(S));// tas

// critical section ...
```
]

???
- It works like that, first in pseudo code
- We have our loop, and in the loop body we spin as long as the lock seem to
  be taken
- For this we use a normal test, one that does not lock the snoopy bus
- And only when the lock seem to be available, i.e. when the test sees that
  the variable is 0, we do a test and set
- This test and set has good chances to succeed
- And if it does not, we go back to spinning with a simple test
--

.rightcol[
- In assembly:

```asm
loop: ldr %r1, %r2   /* address of S in %r2 */
      cmp %r1, $1
      beq loop
      tas %r2
      bnz loop       /* branch if %r2 != 0 */
```
]

???
- In assembly it looks like that
- We load the address of the semaphore variable in r1
- We do the simple test, if it is 1 the lock is not available and we loop
- If it is not 1, we can try the test and set
- If the test and set does not succeed, we loop
--

- Key idea: 
  - Most of the time we busy wait with a standard `ldr`
  - **Only once S is seen to be free, a (costly) `tas` is made**

???

- Inner loop spins while the S variable is seen to be busy
- Only once S is seen to be free, a (costly) `tas` is made
- Hopefully most times it will succeed
  - Will only fail if another core manages to get in between the `cmp` and
    `tas` instructions
- The waiting loop is only executing a normal `ldr` most of the time, which is
  internal to the core and its cache, so no bus cycles are wasted

---
# Other Synchronisation Primitives

- Other machine level atomic instructions:
???
- There are more atomic operations implemented in modern CPUs
--

- **fetch-and-add**: atomically returns the value of a memory location and increments it

```c
// in pseudo-code
fetch_and_add(addr, incr) {
    old_val = *addr;
    *addr += incr;
    return old_val;
}
```
???
- Fetch and add atomically returns the value of a memory location before
  incrementing it

---
# Other Synchronisation Primitives

- Other machine level atomic instructions:
  - **fetch-and-add**: returns the value of a memory location and increments it
  - **compare-and-swap**: compares the value of a memory location with a value
    (in a register) and swap in another value (in a register) if they are equal

```c
// in pseudo-code
compare_and_swap(addr, comp, new_val) {
    if(*addr != comp)
        return false;

*addr = new_val;
    return true;
}
```
???

- Compare and swap compares the value of a memory location with a given value
  and if they are equal, swap it with another value present in a register
---
# Other Synchronisation Primitives

- These are all read-modify-write instructions that lock the snoopy bus during
  their execution
--

- Not really desirable with all CPU designs:
  - Doesn’t fit well with simple RISC pipelines, where RMW is really a CISC instruction requiring a memory load operation, a comparison, and possibly a store operation

???

- These operations are not very practical to implement on RISC processors,
  because each instruction with these is supposed to be simple and RMW are
  quite complex by construction

---
# Lock-Free Data Structures

- Atomic instructions can be used for other goals than implementing locks
- **Lock-free data structures**: data structures that can be accessed concurrently without locks through the use of atomic instructions
  - Lists, stacks, etc.
--

- They are generally **hard to implement**:
  - Updating their state requires more than the single memory store operation done by RMW instructions
  - Hard to know when a member of the data structure can be freed on languages without garbage collectors (e.g. C/C++)
- Benefits: they can be accessed **faster than lock-based data structures**

---
# Lock-Free Data Structures

- Lock-free queue implementation examples:
  - In Java: https://github.com/olivierpierre/comp35112-devcontainer/tree/main/10-hardware-synchronisation/lock-free-queue-java
      - Not too hard because Java has a GC, still not entirely trivial
  - In C: https://github.com/olivierpierre/comp35112-devcontainer/tree/main/10-hardware-synchronisation/lock-free-queue-c
      - A bit convoluted/hacky
- More info: https://www.baeldung.com/lock-free-programming and "The Art of Multiprocessor Programming" chapters 10 and 11

---
# Summary

- Synchronisation requires support from the hardware to ensure that critical
  code section are executed atomically
- Atomic read-modify-write instructions can be used but they are costly and hard to support on RISC CPUs
- Next lecture: how to address these issues by breaking an atomic RMW
  operation into two instructions working together: *load-linked* and
  *store-conditional*

???

- So let's wrap up
- To ensure that critical code sections are protected, we need lock operations
  to be performed atomically
- Atomic read-modify-write instruction can be used for that purpose
- However they can be inefficient in some situation, and hard to implement in
  some processors
- In the next lecture, we'll see how these issue can be address with a couple
  of special instructions implemented in RISC processors: load linked and
  store conditional