class: center, middle background-image: url(include/title-background.svg) # COMP35112 Chip Multiprocessors
# .white[Hardware Support for Synchronisation] .white[Pierre Olivier] ??? - Hi everyone - So previously in the course we have seen how synchronisation is crucial to parallel programming - In this video we are going to see what are the hardware mechanisms on modern CPUs that make synchronisation mechanisms possible --- # Implementing Synchronisation - Shared-memory programming requires synchronisation mechanisms to protect shared data ??? - So we have seen that when programming a multithreaded application, that relies on shared memory for communication between threads, synchronisation mechanisms such as locks are required to protect critical section so that we can properly reason about the program's behaviour -- - Mechanisms can take several forms - But all are closely related and most can be built from the others ??? - There are multiple mechanisms, but they are all closely related and most can be built on top of others - For example we can easily build a barrier with a lock - Now we have also seen that source code statement such as i++ can be translated by the compiler into several instruction, so they are not executed atomically thus using them alone is unfit for critical sections - How can we be sure that the lock and unlock statements will be executed atomically - After all, if you look at the lock operation, it can involves 2 memory accesses, checking if the lock is available (checking a memory slot) and taking the lock (writing in the memory slot) -- - **Their implementation usually requires hardware support** ??? - The implementation of the lock and unlock operation requires hardware support so that they can be atomic -- - We'll have a look at one of the simplest constructs - A lock called **binary semaphore**, in a processor with a snoopy cache - Can be held by at most 1 thread - Waiting threads use busy-waiting ??? - We will illustrate this with the simplest lock, a binary semaphore - Only a single thread can hold it - The waiting threads will spin - And we'll do that in a CPU with a snoopy cache --- # Example: Binary Semaphore - It's a single shared "boolean" variable **`S`** which value is used to protect a shared resource - `S == 0` ➡ resource is free - `S == 1` ➡ resource is in use ??? - So the binary semaphore is a single shared boolean variable in memory - When it is one the protected resource is free, i.e. the lock is ready to be taken - When it is 0 the resource is busy and the lock is taken -- - Semaphore operations (should be **atomic**) - **`wait(S)`**: wait until `S != 1` then set `S = 1` (i.e. take the lock) - **`signal(S)`**: set `S = 0` (i.e. release the lock) ??? - So we have the lock and unlock operation - Lock can also be called wait - It consists in waiting for the lock to be free, i.e. for the variable to be different from 0, then setting it to 0 to indicate that we take the lock - Note that this should be atomic, if two threads see that the lock is free and sets the variable at the same time, things won't work - Then we have the unlock operation, we release the lock by setting the variable to 1 - This one should also be atomic, a priori that's easier it can be translated to a simple store but even in that case we need to be careful, what if the variable is on 64 bits and we are on a 32 bit CPU, the compiler may translate S=1 into two load operations! --- # Semaphore Usage to Protect Critical Sections - **Critical sections** are the code sections where shared resources are manipulated
- `S` should be initialised as `0` ??? - So remember that critical sections are the parts in your code that access shared data and you want them to be executed in a serial fashion, i.e by 1 thread at a time, and atomically, i.e. when 1 thread starts to execute the critical section, it should finish before another thread can enter that critical section - We can achieve that with the lock - We initialise the lock to 1, i.e. ready to be taken - We take the lock before the critical section, we know that only one thread will atomically succeed and set it to 0, the others will wait on the lock, for it to go back to 1 - Which is the case when the first thread releases the lock - So we have the critical section executed serially and atomically --- # Atomicity Needed - How to implement `wait(S)`? ??? - So how can we implement the lock taking operation, wait? -- .leftcol[ ```c // naive implementation in C: while(S == 1); S = 1; ``` ] ??? - Intuitively we would do something like that in the source code right - While the semaphore value is 0 we spin waiting - And once it reaches 1 we set it to 0 to indicate that we got the lock -- .rightcol[ ```c // address of `S` in `%r2` *loop: ldr %r1, %r2 * cmp %r1, $1 * beq loop str $1, %r2 ``` ] ??? - Well that does not work because once this is compiled it's not atomic at the machine code level - It is compiled into something like that - We assume that the address of the lock variable is in r2 - We load it in a register, r1 with a lock operation - Then we compare it to 0 and if it is equal to 0 we loop - Otherwise we store 0 in the variable -- .leftcol[ - What if another thread changes the value of `S`?
] ??? - And in that situation, atomicity is not guaranteed - A scenario like this can happen - First thread 1 does the load and thread 2 does the same right after - Both do the comparison, see that it's not 0 so they don't branch, and then they both do the store -- .rightcol[ **Both threads got the lock!** The lock itself is a shared data structure... ] ??? - Both have taken the lock! that's bad --- # Atomic Instructions - We need to ensure that the execution of `wait()` is "indivisible" - I.e. it should be **atomic**: once a thread starts to execute `wait()` it should first finish it before any other thread can start it ??? - So it is clear that wait must be atomic, i.e. the entire lock taking operation must be done at once -- - Requires special instructions to be supported in hardware: **atomic instructions** - Special CPU instructions that realise a few operations atomically - Operations are generally a memory load, a comparison, and possibly a memory write ??? - This require special instructions supported by the hardware, called atomic operations -- - Implementing synchronisation primitives like `wait()` with these instructions involves a compromise between complexity and performance ??? - There are various way for the hardware to implement these operations, with different performance/complexity tradeoffs -- - Also note that variable `S` when accessed will be cached, possibly in several core caches, and the desired atomic behaviour might require coherence operations in the cache ??? - Also note that the implementation of these operations needs to take into account the fact that variable S may be cached in different caches --- # Atomic Test-And-Set Instruction - Simple solution in older CPUs, e.g. Motorola 68K ```c tas %r2 ``` - If memory location addressed by `%r2` contains `0`, switch its content to `1` and set the CPU "zero" flag, otherwise clear zero flag. ??? - A classic example of atomic instruction is test-and-set - It is present in many CPUs - It takes a register containing an address at parameter and performs, atomically, the following things - It checks that the memory slot addressed by the register contains 0, and if it is the case, it sets the content of that memory slot to 1 as well as the CPU zero flag. - If the memory slot was not 0, the zero flag is cleared - Here is an example of usage in pseudo assembly, we just do test and set (TAS) on a memory slot which address is present in the register r2 -- - **Instruction-level behaviour is atomic** - Cannot be interrupted - No other core can modify what is pointed by `%r2` in memory while the `tas` runs ??? - This is atomic because the test and set guarantees that only one thread will be able to read and possibly modify the lock variable at a time --- # Atomic Test-And-Set Instruction
??? - Let's illustrate an example - Say we want to do a test and set on address 0x42 - We place that address in a register, say `%r2` - And we do `tas %r2` - Now assume address 0x42 contains 0 - Test and set will then switch it to 1, and will also set the zero flag of the core to 1 --
??? - Now on the contrary if before the test and set address 0x42 contains something else than 0, for example 1 - Test and set won't touch it and will clear the zero flag on the core --- # Our Semaphore with `tas` - Remember that for our semaphore: - Lock is free when `S == 0` - Lock is taken when `S == 1` - How to implement `wait()` and `signal` with test-and-set? -- .leftcol[ Wait operation (taking the lock): ```c // Address of S in %r2 // Loops (i.e. wait) while [%r2] != 0 loop: tas %r2 bnz loop // branch if zero flag not set ``` ] .rightcol[ Signal operation (releasing the lock): ```c // We assume that basic store operations // are atomic // Address of S in %r2 str $0, %r2 ``` ] ??? - Now how can we implement our semaphore with the test and set atomic instruction? - We have our lock S in memory at a given address - The operation of attempting to take the lock, wait, is the the left - We have the address of our lock in `%r2` - We try a test and set - If the lock's value is 1, we consider that the lock is not free - In that case the test and set will not modify the lock's value and will clear the zero flag - So the branch if not zero will go back to the loop label and we try again, i.e. we spin, until the lock is free - When the lock is free it's value will be 0 so test and set will set it to 1, we won't branch and we'll continue with the lock held - The lock release operation is on the right - We assume here that the store assembly instruction is atomic - Releasing the lock simply consist in setting the value of the lock to 0 to indicate that it is free to take - Note that we inverted the meaning of the lock's values that we gave previously - With test and set, 0 means the lock is free, and 1 means it is not - The lock also needs to be initialised to 0 --- # What About the Cache? - Semaphore operation with test-and-set is reasonably obvious if S is a single shared variable **in memory** - Just `tas` for `wait()` and `str` for `signal()` on that single variable ??? - So if we assume a system with no cache the semaphore operations with test and set are pretty straightforward - S is a single shared variable in memory, and it is locked with test and set and released with a simple store -- - `tas` is an atomic **read-modify-write** (RMW) instruction and as such it is expensive: - May involve 2 memory accesses (R & W) - Locks the access to memory from other processors to ensure atomicity ??? - One thing to note is that operations such as test and set are classified as read-modify-write because this are the three operation they do atomically, and for these to work the CPU prevents access to memory from the other cores during the execution of the RMW operation by a core -- - By definition `S` is shared: this is the fundamental purpose of a semaphore - **Processors are therefore likely to end up with a copy of `S` in their cache** ??? - Now, in a real CPU with a cache, because of the shared nature of S, it is likely that S will be cached - so how does that work? --- # Test-and-Set and the Cache - Assume shared variable `S` in the cache - Only when a `tas` succeeds (reads 0) it must then write a 1 ??? - So if the shared variable is in the cache -- - When `tas` starts, don't know if a write will be needed or not... - ... if it is, need to send an invalidate message to other cores ??? - TAS reads from the cache and if it sees the variable is 0 it must then perform a write, so it needs to send an invalidate message to other cores -- - **So the processor must 'lock' the snoopy bus for every multiprocessor `tas` operation** - Cannot let any other core do a write ??? - So for all of this to be atomic, the core needs to lock the snoopy bus for the duration of the test and set - This basically prevent all the other cores to do a write in case they would try to write on the shared variable - Of course this affects performance -- - But if it ends up reading a '1' (lock not available), this locking of the bus was wasted because the `tas` was read-only... ??? - And also note that in the case the lock is taken, i.e. if the TAS reads a 1, locking the bus was wasted --- # Test-and-Set and the Cache - Assume one thread has the lock: `S` is busy - Another wanting the semaphore will read this busy value and cache it - It will then sit in a loop continually executing a `tas` until `S` becomes free ??? - Assume one thread has the lock and another thread tries to take it, so the the other core is going to cache the value and it's going to spin in a loop continually executing the test and set until the lock becomes free -- - **All this time it will be wasting bus cycles** - Slowing down cache coherence traffic from other cores ??? - Given that test and set locks the snoopy bus and prevent writes from all cores this seriously impacts performance -- - Can address that issue with a simple re-formulation of the wait operation: **test-and-test-and-set** - Tries to minimise the amount of costly test-and-set ??? - We can avoid this issue with a trick, a particular way to use test and set, called test and test and set --- # Test-and-test-and-set - How to implement `wait()` with test-and-test-and-set? .leftcol[ - In pseudo-code: ```c do { while(test(S) == 1); // traditional ldr } while (test-and-set(S));// tas // ``` ] ??? - It works like that, first in pseudo code - We have our loop, and in the loop body we spin as long as the lock seem to be taken - For this we use a normal test, one that does not lock the snoopy bus - And only when the lock seem to be available, i.e. when the test sees that the variable is 0, we do a test and set - This test and set has good chances to succeed - And if it does not, we go back to spinning with a simple test -- .rightcol[ - In assembly: ```asm loop: ldr %r1, %r2 /* address of S in %r2 */ cmp %r1, $1 beq loop tas %r2 bnz loop /* branch if %r2 != 0 */ ``` ] ??? - In assembly it looks like that - We load the address of the semaphore variable in r1 - We do the simple test, if it is 1 the lock is not available and we loop - If it is not 1, we can try the test and set - If the test and set does not succeed, we loop -- - Key idea: - Most of the time we busy wait with a standard `ldr` - **Only once S is seen to be free, a (costly) `tas` is made** ??? - Inner loop spins while the S variable is seen to be busy - Only once S is seen to be free, a (costly) `tas` is made - Hopefully most times it will succeed - Will only fail if another core manages to get in between the `cmp` and `tas` instructions - The waiting loop is only executing a normal `ldr` most of the time, which is internal to the core and its cache, so no bus cycles are wasted --- # Other Synchronisation Primitives - Other machine level atomic instructions: ??? - There are more atomic operations implemented in modern CPUs -- - **fetch-and-add**: returns the value of a memory location and increments it ```c // in pseudo-code fetch_and_add(addr, incr) { old_val = *addr; *addr += incr; return old_val; } ``` ??? - Fetch and add atomically returns the value of a memory location before incrementing it --- # Other Synchronisation Primitives - Other machine level atomic instructions: - **fetch-and-add**: returns the value of a memory location and increments it - **compare-and-swap**: compares the value of a memory location with a value (in a register) and swap in another value (in a register) if they are equal ```c // in pseudo-code compare_and_swap(addr, comp, new_val) { if(*addr != comp) return false; *addr = new_val; return true; } ``` ??? - Compare and swap compares the value of a memory location with a given value and if they are equal, swap it with another value present in a register --- # Other Synchronisation Primitives - Other machine level atomic instructions: - **fetch-and-add**: returns the value of a memory location and increments it - **compare-and-swap**: compares the value of a memory location with a value (in a register) and swap in another value (in a register) if they are equal - **All these instructions are RMW with the need to lock the snoopy bus during their execution** ??? - These are all read-modify-write instructions that lock the snoopy bus during their execution -- - Not really desirable with all CPU designs: - Doesn’t fit well with simple RISC pipelines, where RMW is really a CISC instruction requiring a memory load operation, a comparison, and possibly a store operation ??? - These operations are not very practical to implement on RISC processors, because each instruction with these is supposed to be simple and RMW are quite complex by construction --- # Lock-Free Data Structures - Atomic instructions can be used for other goals than implementing locks - **Lock-free data structures**: data structures that can be accessed concurrently without locks through the use of atomic instructions - Lists, stacks, etc. - They are generally **hard to implement**: - Updating their state requires more than the single memory store operation done by RMW instructions - Hard to know when a member of the data structure can be freed on languages without garbage collectors (e.g. C/C++) - Benefits: they can be **faster than lock-based data structures** --- # Lock-Free Data Structures - Lock-free queue implementation examples: - In Java: https://github.com/olivierpierre/comp35112-devcontainer/tree/main/10-hardware-synchronisation/lock-free-queue-java - Not too hard because Java has a GC, still not entirely trivial - In C: https://github.com/olivierpierre/comp35112-devcontainer/tree/main/10-hardware-synchronisation/lock-free-queue-c - A bit convoluted/hacky - More info: https://www.baeldung.com/lock-free-programming and "The Art of Multiprocessor Programming" chapters 10 and 11 --- # Summary - Synchronisation requires support from the hardware to ensure that critical code section are executed atomically - Atomic read-modify-write instructions can be used but they are costly and hard to support on RISC CPUs - Next lecture: how to address these issues by breaking an atomic RMW operation into two instructions working together: *load-linked* and *store-conditional* ??? - So let's wrap up - To ensure that critical code sections are protected, we need lock operations to be performed atomically - Atomic read-modify-write instruction can be used for that purpose - However they can be inefficient in some situation, and hard to implement in some processors - In the next lecture, we'll see how these issue can be address with a couple of special instructions implemented in RISC processors: load linked and store conditional