COMP35112 - Load-Linked and Store-Conditional

class: center, middle
background-image: url(include/title-background.svg)
# COMP35112 Chip Multiprocessors
<br/>
<br/>
# .white[Load-Linked and Store-Conditional]

.white[Pierre Olivier]

???
- Hello everyone
- In the previous lecture we have seen that read-modify-write atomic instructions
  could implement synchronisation, but that their complexity made them
  inefficient in certain situations
- In this video we are going to see a couple of simple hardware instructions
  working together to implement a synchronisation mechanism:
  load linked and store conditional

---
# Load-Linked and Store-Conditional

- LL/SC used in RISC processors: ARM, IBM Power, etc.
- 2 instructions:
  - **Load-linked**
  - **Store-conditional**
--

- Slightly different from traditional loads and stores
  - Additional effects on CPU state
--

- **Can act as a pair atomically without holding the bus until completion**

???

- LL/SC is a synchronisation mechanism used in modern RISC processors such as
  - ARM, PowerPC, RISC-V, and so on
- It relies on 2 separate instructions that slightly differ from traditional loads and
  stores: **load-linked** and **store-conditional**
- They have additional effects on processor state
- This allows them, to act atomically as a pair while avoiding to hold the bus
    until completion
- Were are going to see how that works

---
# Load-Linked

```asm
ldl %r1, %r2
```
- `%r1` ← `*(%r2)`
- In addition, **core keeps some state for the `ldl`**:
  - Sets *load linked flag*
  - Saves the address (from `%r2`) in the *locked address register*

???

- Let's first see load linked
- It takes two registers as parameters
- Similar to a traditional load instruction, load linked loads register `r1`
  with the value from memory at the address in register r2
- It also sets a special load linked flag on the core which executes it
- And it records the address placed in r2 inside a register on the core, named
  locked address register
- So the core holds some state related to this load link instruction

---
# Store-Conditional

```asm
stc %r1, %r2
```

- *Tries* to do `*(%r2)` ← `%r1`
  - **Only succeeds if load linked flag set**
- Afterwards value of load linked flag returned in `%r1`
- And the load link flag is cleared

???
- So regarding store conditional
- What it does is to *try* to store the value in `%r1` into memory
  at the address in register `%r2`
- Different from a traditional load, it only succeeds if the "load linked flag"
  is set
- After the store conditional is executed, the value of the "load linked flag",
  is returned in `%r1`, this value represents whether or not the store was successful
- And if the flag was set it is cleared

---
name:flag
# The Load-Linked Flag

- **Load linked flag is cleared if another core writes to the locked address**

???

- Now the last piece of the puzzle you need to understand is the behaviour
  of the load linked flag
- If a core has the load-linked flag set, that flag is unset if there is a write from
  another core on the locked address
- This is detected by comparison with the "locked address register"
- The processor must snoop the memory address bus to detect this

---
template:flag

???

- Let's take an example, we have the memory, two cores each with the LL/SC
  flag

---
template: flag
<div style="text-align:center"><img src="include/llc-sc-2.svg" width=300 /></div>

???
- Core 1 does a load link from address 12 and its flag is set

---
template: flag

???
- But before core 1 does his store conditional, core 2 writes to address 12
- This automatically clears the flag on core 1

---
template: flag

???

- The subsequent store-conditional from core 1 will fail
--
- Other events clearing the flag: context switches, interruptions
- **Allows to be sure that a `ll`/`sc` pair executed atomically with respect
  to the locked address**

???
- There are also other events that clear the flag: these are control flow
  breaking events that are not intended by the programmer, such as
  context switches or exceptions

---
# Our Semaphore with LL/SC

```asm
/* Address of the lock in %r2 */

loop: ldl %r1, %r2
      comp $0, %r1  /* S == 0 (semaphore already taken)? */
      beq loop      /* if so, try again */
      mov $0, %r1   /* looks like it's free, prepare to take the semaphore */
      stc %r1, %r2  /* Try to take it */
      cmp $1, %r1   /* Did the write succeed? */
      bne loop      /* If not, someone beat us to it... try again */

/* critical section here... */

st $1, %r2    /* release the semaphore with a simple store */
```

???

- Now let's have a look at how we can implement a semaphore with LL/SC in
  assembly
- The lock is a simple byte in memory, when it is 1 the semaphore is free, when
  it is 0 the semaphore is taken
- We assume the address of the lock is stored in r2
- We start by doing a load link from this address into r1
- If the value is 0 the semaphore is taken so we branch back to the beginning
  of the loop
- If it's 1 the semaphore seems to be free so the cost must now try to take it
- We put the constant 0 into r1 and then try to store this value into the byte
  representing the lock, with a store conditional in the address that is in r2
- Now remember that store conditional may fail if someone else wrote to the
  lock between the moment we checked its value with the load-linked
- So we check if our store conditional succeeded by checking if r1 contains 1
- If not, someone beat us to the lock and we have to try again
- If we got the lock, we execute the critical section, and release the lock
  with a simple store

---
# Our Semaphore with LL/SC

```asm
/* Address of the lock in %r2 */

loop: ldl %r1, %r2
*     comp $0, %r1  /* S == 0 (semaphore already taken)? */
*     beq loop      /* if so, try again */
*     mov $0, %r1   /* looks like it's free, prepare to take the semaphore */
      stc %r1, %r2  /* Try to take it */
      cmp $1, %r1   /* Did the write succeed? */
      bne loop      /* If not, someone beat us to it... try again */

/* critical section here... */

st $1, %r2    /* release the semaphore with a simple store */
```

- Highlighted code executes atomically with respect to the memory location
  pointed by `%r2`
  - Any write to that location/interrupt/context switch during the atomic part will lead to `stc` failing

???

- Now let me repeat that a bit just to be 100% clear
- We have this yellow part of the code which is the part that we would like to
  execute atomically, between the moment the thread thinks that the lock is
  free, and the moment it actually takes the lock
- If any code manages to take the lock (in other words to write to the address in r2)
  during the highlighted section (or thread is descheduled), or if something breaks the
  execution flow like an interrupt or a context switch,
  the `stc` executed as part of the lock taking operation will fail and the loop will
  repeat
- Otherwise everything between `ldl` and `stc` must have executed as if
  "atomically" so the core has the lock

---
# The Power of LL/SC

- Single atomic RMW instructions such as `tas` atomically combine load (R) and store (W)
  operations, and a potential in-register update (M)
--

- The code starting with LL and ending with SC is *not atomic in absolute*
  - Even for a LL directly followed by a LC
--

- However LL/SC allows to determine **if anything between the LL and the SC has executed
    atomically or not, with respect to the corresponding address**

???

- With an instruction like `tas`, the load and the store can be guaranteed to
  be atomic by RMW behaviour
- When using LL/SC, we know that anything **between** the `ldl` and the `stc`
  has executed atomically *with respect to the synchronisation variable*
- This can be more powerful than `tas` for certain forms of usage:
  - For example LL/SC are an easy and efficient way implement fetch-and-add &
  friends
- They also Reduce the number of special instructions that need to be supported
by the architecture

---
# Spinlock

- All our implementations of `wait(S)` use a busy loop
  - Called **busy waiting** or **spinning**
  - **Spinlocks**
--

- Hurt performance when contended
--

- **Sleeping would be more efficient**
  - Such higher-level locks can be called *mutexes*
  - Requires OS support

???

- One last thing before we wrap up
- All the implementation of the wait operation we developed use a busy loop
- Threads never go to sleep and keep trying to get the lock over and over again
- This is called busy waiting or spinning, and such locks are named spinlocks
- They monopolises the processor and can hurt performance when the lock is
  contented
- Putting waiting threads to sleep, in other words scheduling them out of the CPU,
  would be much more efficient from the resource usage point of view
- There are various more sophisticated forms of locking to address this
- They are implemented at the OS level and the basic hardware support is the same

---
# Summary

- Atomic RMW instructions: inefficient/hard to implement in some situations
- LL/SC address these problems
  - Atomic execution with respect to a particular memory location
  - Without locking the cache coherence bus
- Next lecture: can we use such atomic instructions,
  **instead of locks**?

???

- So in summary
- Atomic read-modify-write instructions can be used but they are inefficient
  in some situations
- Atomic load-linked and store-conditional instructions address that issue
- The next lecture investigates whether we can use such atomic instructions,
  **instead of locks**, to implement so-called *lock-free data structures*