class: center, middle background-image: url(include/title-background.svg) # COMP35112 Chip Multiprocessors
# .white[Synchronisation in Parallel Programming - Locks and Barriers] .white[Pierre Olivier] ??? - So in the past lectures we have seen how to program a shared memory processor using threads, execution flows that share a common address space which is very practical for communications - However the examples we have seen were extremely simple, and this is also true for the first lab exercise you are currently working on - The reason I say they are simple is that first threads do not need to sync up at any point, apart from the final join - Furthermore, there isn't much shared data as threads work on subsets of the arrays that are accessed only by one thread, so things are easy. - Today we are going to see mechanisms that make synchronisation and data sharing (especially in write mode) possible --- # Synchronisation Mechanisms - In a multithreaded program, threads rarely execute completely independently: - In many scenarios they need to wait for each other and to communicate by accessing shared data -- - Brings the need for **synchronisation mechanisms** -- - In this lecture we will cover: - **Barriers**: simple mechanism letting threads wait for each other at a given point in their execution - **Locks** allowing threads to safely access shared data - In the next video you will see: - **Condition variables** for event signalling between threads ??? - So as I said we have seen very simplistic examples of multithreaded programs - But in reality, in a traditional program the threads are rarely completely independent: they need to wait for each other at particular points during the computations, they need to communicate and share data, in particular in write mode - To allow all of this we use synchronisation mechanisms, maybe you have already heard of things like barriers, locks, semaphores, etc. - These are software constructs that allow things like threads waiting for each others, data sharing and communications - In this lecture we'll cover barriers and locks --- class: inverse, center, middle # Barriers ??? - Let's start with barriers --- name: barrier # Barriers
??? - A barriers allows selected threads to meet up at a certain point in the program execution - Upon reaching the barriers, all threads wait until the last one reaches it - In this example we have the time flowing horizontally - Thread 2 reaches the barrier first, and it starts waiting for all the other threads to reach that point - Thread 1, then 3 do the same - When thread 4 reaches the barriers, it's the last one so all threads resume execution --- template: barrier - **A barrier is where a number of threads "meet up"** - When all threads have reached it, they can all proceed --- # Barriers
- Very natural when threads are used to implement data parallelism - Want the whole answer from this step before proceeding to the next step --- # Barriers
- Very natural when threads are used to implement data parallelism - Want the whole answer from this step before proceeding to the next step - Would also use when data dependence limits loop parallelisation - Many barriers primitives allow multiple use ??? - Barriers are useful in many scenarios - For example with data parallelism, assuming an application is composed of multiple phases or steps - An example could be a first step in which threads first filter the input data, based on some rule, and then in a second step the threads perform some computation on the filtered data - We may want to have a barrier to make sure that the filtering step is done before starting the computing step - Another use case is when, because of data dependencies, we can parallelise only a subset of a loop's iterations at a time - Remember the lecture from 2 weeks ago - We can put a barrier in a loop to ensure that all the parallel iterations in one step are computed before going to the next step - This barrier provided by java can be reused so it's OK to have it within a loop --- # Barrier Example We'll write a small program with 2 threads executing as follows:
- Force thread 2 to execute for a longer time than thread 1 at each iteration --- # Barrier Example We'll write a small program with 2 threads executing as follows:
- Force thread 2 to execute for a longer time than thread 1 at each iteration --- # Barrier Example We'll write a small program with 2 threads executing as follows:
- Force thread 2 to execute for a much longer time than thread 1 at each iteration ??? - So let's have a look at an example of barrier - We'll create 2 threads - What they do is that they perform some kind of computations (green part) - Then they reach the barrier, and they print the fact that they have reached the barrier - We'll make sure that the amount of computations in one thread is much larger than the amount in the other thread, so we should see one thread printing "I have reached the barrier first", then there should be some time before the other thread prints the fact that it reached the barrier. At that point both threads should resume execution and both will print out the fact that they are past the barrier - And we rinse and repeat that in a loop --- # Barrier Example .leftcol[ ```c #define ITERATIONS 10 typedef struct { int id; int spin_amount; pthread_barrier_t *barrier; } worker; ``` ] .rightcol[ ```c void *thread_fn(void *data) { worker *arg = (worker *)data; int id = arg->id; int iteration = 0; while(iteration != ITERATIONS) { /* busy loop to simulate activity */ for(int i=0; i
spin_amount; i++); printf("Thread %d reached barrier\n", id); int r = pthread_barrier_wait(arg->barrier); if(r!=PTHREAD_BARRIER_SERIAL_THREAD && r) { perror("pthread_barrier_wait"); exit(-1); } printf("Thread %d passed barrier\n", id); iteration++; } pthread_exit(NULL); } ``` .codelink[
`08a-locks-barriers/barrier.c`
] ] --- # Barrier Example ```c #include
// for errx /* ... */ #define T1_SPIN_AMOUNT 200000000 #define T2_SPIN_AMOUNT (10 * T1_SPIN_AMOUNT) int main(int argc, char **argv) { pthread_t t1, t2; pthread_barrier_t barrier; worker w1 = {1, T1_SPIN_AMOUNT, &barrier}; worker w2 = {2, T2_SPIN_AMOUNT, &barrier}; if(pthread_barrier_init(&barrier, NULL, 2)) errx(-1, ("pthread_barrier_init")); // equivalent of perror(msg); exit(-1); if(pthread_create(&t1, NULL, thread_fn, (void *)&w1) || pthread_create(&t2, NULL, thread_fn, (void *)&w2)) errx(-1, "pthread_create"); if(pthread_join(t1, NULL) || pthread_join(t2, NULL)) errx(-1, "phread_join"); return 0; } ``` .codelink[
`08a-locks-barriers/barrier.c`
] ??? - This is the main method - We define the spinning time of thread 2 as 4 times the spinning time of thread 1 - We create the barrier, indicating that it will concern 2 threads - We create the threads and launch them, and join --- class: inverse, middle, center # Locks ??? - Now let's talk about locks --- # Locks: Motivational Example 1 - **Locks protect shared data from concurrent access** - Why is it needed? - Motivational example: withdrawing cash at a cash machine -- Cash machine pseudo code for the withdrawal operation: ```c int withdrawal = get_withdrawal_amount(); /* amount the user is asking to withdraw */ int total = get_total_from_account(); /* total funds in user account */ /* check whether the user has enough funds in her account */ if(total < withdrawal) abort("Not enough money!"); /* The user has enough money, deduct the withdrawal amount from here total */ total = total - withdrawal; update_total_funds(total); /* give the money to the user */ spit_out_money(withdrawal); ``` ??? - So we need locks to protect data that is shared and that can be accessed in write mode by at least 1 thread - Let's motivate why this is very important - Let's a assume that we have a cash machine, which supports various operations and among them cash withdrawal by a client - This is the code for the withdrawal function - First the cash machine queries the bank account to get the amount of money in the account - It also gets the amount the user wants to withdraw from some input - The machine then checks that the user has enough money to satisfy the request amount to withdraw - If not it returns an error - If the check passes, the machine compute the new value for the account balance and updates it, then spits out the money --- name:atm # Locks: Motivational Example 1 .leftcol[ ```c int withdrawal = get_withdrawal_amount(); int total = get_total_from_account(); if(total < withdrawal) abort("Not enough money!"); total = total - withdrawal; update_total_funds(total); spit_out_money(withdrawal); ``` ] ??? - Now this all seems fine when there is only one machine - Let's see what happens when concurrency comes into play - I put a summarised version of the code on the left of the slide, it's exactly the same as on the previous slide --- template:atm .rightcol[ - Assume 2 transactions are happening nearly at the same time - E.g. shared credit card account - Assume: `total == 105`, `withdrawal1 == 100`, `withdrawal2 == 10` - At least one should fail as `(100+10) > 105` ] ??? - Let's assume we now have 2 transactions happening at the same time from 2 different cash machines - This could happen in the case of a shared bank account for example - We assume that there is 105 pounds on the account at first, and that the first transaction is a 100 pounds withdrawal while the second is a 10 pounds withdrawal - One of these transactions should fail because there is not enough money to satisfy both --- template:atm .rightcol[.medium[ - `total == 105`, `withdrawal1 == 100`, `withdrawal2 == 10` - A possible scenario: 1. Threads get total in local variable, both get `105` 2. Threads check that `100 < 105` and `10 < 105` - All good 3. Thread 1 updates: - `total = 105 - 100 = 5` - `update_total_funds(5)` 4. Slightly later thread 2 updates: - `total = 105 - 10 = 95` - `update_total_funds(95)` ]] ??? - So we have 105 pounds in the account, and two withdrawals, one of 100 the other of 10 - This is a possible scenario - Both threads get the total in their local variable, both get 105 - Both threads perform the balance check against the withdrawal amount, both pass because 100 < 105 and 10 < 105 - Thread 1 then updates the account balance with 105-100 = 5 and spits out the money - Then thread 2 updates the account, with 105 - 10 = 95 and spits out the money -- **Total withdrawal: 110, and there is 95 left on the account!**
Free money 🤑 - **Race condition:** concurrent operations on shared state should not happen at the same time ??? - So in total not only we have withdrew more money than what there was on the account, but also there is still 95 pounds on the account, this is free money - Of course this behaviour is incorrect - It is called a race condition, when shared data (here the account balance) is accessed in write mode by at least 1 thread (here it is accessed in write mode by both cash machines i.e. threads) - We need locks to solve that issue, to protect the shared data against races --- name:lock2 # Locks: Motivational Example 2 - Consider the `i++` statement - Could translate into machine code as: ```bash 1. load the current value of i from memory into a register 2. add one to the value stored into the register 3. store from the register to memory the new value of i ``` ??? - Now let's take a second, more low level example - Consider the i++ statement in a language like java or C - Let's assume that the compiler or the JV is transforming this statement into the following machine instructions - First load i into a register - Then increment that register - And then store the new value back in memory --- template:lock2 .leftcol[ **When 2 threads execute `i++`, we expect `i` to end up being incremented twice**. A possible scenario: ] .rightcol[
] ??? - If we have 2 threads doing the ++ operation on a shared variable we can have the following scenario - Thread 1 loads i in a register, it's 7, then increments it, it becomes 8, and then stores 8 back in memory - Next, thread 2 loads i, it's 8, increment it to 9, and stores back 9 - So far so good --- template:lock2 .rightcol[
] .leftcol[ Another possible scenario: **race condition 🔥** - Need **locks** to **serialise** access to shared data, i.e. to ensure that **critical sections** are executed atomically ] ??? - But here is another possible scenario - Both threads load 7 at the same time - Then they increment the local register, it becomes 8 in both cores - And then they both store 8 back - This is incorrect behaviour, once again a race condition because they are accessing a shared variable in write mode without proper synchronisation - Note that this can also happen on a single core, according to how the scheduler order the operations between both threads - Once again we need a lock --- # Critical Sections .leftcol[ - Bits of code in our program where shared data is accessed/updated are called **critical sections** ] .rightcol[ ```c int withdrawal = get_withdrawal_amount(); *int total = get_total_from_account(); *if(total < withdrawal) * abort("Not enough money!"); *total -= withdrawal; *update_total_funds(total); spit_out_money(withdrawal); ``` ] -- - **Lock: synchronisation primitive enforcing limits on the execution of a critical section:** - **Serialisation**: amount of threads that can concurrently execute it (generally 1) - **Atomicity**: when a thread T starts to run a critical section S, T must first finish S before another thread can enter S ??? - So the bits of code in a parallel program that are accessing shared data are called critical sections - A lock is a synchronisation mechanism that allows to avoid race condition, by enforcing some rules for the execution of critical section - First, it defines how many threads can run a critical section concurrently, generally it is one, this is called serialisation: we prevent parallelism for the critical section - Second, locks allow to make sure that the critical section are executed atomically, in other words that once a thread starts to run the critical section, it first finishes before another thread can enter it --- # Locks - Each piece of shared data is protected by a dedicated lock - Critical section relevant to this piece of shared data are surrounded by `lock` and `unlock` operations on that lock: ```c int withdrawal = get_withdrawal_amount(); lock(account_lock); *int total = get_total_from_account(); *if(total < withdrawal) * abort("Not enough money!"); *total -= withdrawal; *update_total_funds(total); unlock(account_lock); spit_out_money(withdrawal); ``` --- name: locks # Locks - Critical sections can be protected with **locks**: - **Only one thread can hold a given lock at a time** (serialisation) - **Holding the lock lets the thread enter and execute the corresponding critical section in its entirety** before another thread can get the lock and run the critical section (atomicity) -- - A thread wishing to enter the critical section **tries** to take the lock: - A thread attempting to take a free lock will get it - If a lock is not free the thread needs to wait until the lock is released by its holder --- template: locks
--- template: locks
--- template: locks
--- template: locks
--- # Pthreads Mutexes - **Mutexes: mutual exclusion locks** ```c #include
pthread_mutex_t mutex; void my_thread_function() { pthread_mutex_lock(&mutex); /* critical section here */ pthread_mutex_unlock(&mutex); } ``` -- **Important: in the next examples we omit pthread functions return code checking for the sake of brevity, but note that almost all of them can fail!** --- # Lock Usage Example - **Bounded buffer**: circular fixed-size FIFO producer-consumer buffer
??? - We are going to define the following data structure named bounded buffer - It's a fixed size buffer in which can be accessed concurrently by multiple threads - It's a FIFO producer consumer buffer - Threads can deposit data in the buffer - And thread can also extract data - This is done in a first in first out fashion --- # Lock Usage Example ```c typedef struct { int *buffer; // the buffer int max_elements; // size of the buffer int in_index; // index of the next free slot int out_index; // index of the next message to extract int count; // number of used slots pthread_mutex_t lock; // lock protecting the buffer } bounded_buffer; int init_bounded_buffer(bounded_buffer *b, int size) { b->buffer = malloc(size * sizeof(int)); if(!b->buffer) return -1; b->max_elements = size; b->in_index = 0; b->out_index = 0; b->count = 0; pthread_mutex_init(&b->lock, NULL); // mutex initialisation return 0; } void destroy_bounded_buffer(bounded_buffer *b) { free(b->buffer); } ``` .codelink[
`08a-locks-barriers/lock.c`
] ??? --- # Lock Usage Example (continued) ```c void deposit(bounded_buffer *b, int message) { pthread_mutex_lock(&b->lock); int full = (b->count == b->max_elements); while(full) { // buffer is full, can't deposit! Release the lock and wait a bit // to give another thread a chance to extract an element pthread_mutex_unlock(&b->lock); usleep(100); pthread_mutex_lock(&b->lock); // is the buffer still full? full = (b->count == b->max_elements); } // perform deposit b->buffer[b->in_index] = message; b->in_index = (b->in_index + 1) % b->max_elements; b->count++; pthread_mutex_unlock(&b->lock); } ``` .codelink[
`08a-locks-barriers/lock.c`
] --- # Lock Usage Example (continued) ```c int extract(bounded_buffer *b) { pthread_mutex_lock(&b->lock); int empty = !(b->count); while(empty) { // buffer is empty, nothing to extract! Release the lock and wait a bit // to give another thread a chance to deposit an element pthread_mutex_unlock(&b->lock); usleep(100); pthread_mutex_lock(&b->lock); // is the buffer still empty? empty = !(b->count); } // perform extract int message = b->buffer[b->out_index]; b->out_index = (b->out_index + 1) % b->max_elements; b->count--; pthread_mutex_unlock(&b->lock); return message; } ``` .codelink[
`08a-locks-barriers/lock.c`
] ??? --- # Lock Usage Example (continued) .leftcol[ ```c typedef struct { int iterations; bounded_buffer *bb; } worker; void *deposit_thread_fn(void *data) { worker *w = (worker *)data; for(int i=0; i
iterations; i++) { deposit(w->bb, i); printf("[deposit thread] put %d\n", i); } pthread_exit(NULL); } void *extract_thread_fn(void *data) { worker *w = (worker *)data; for(int i=0; i
iterations; i++) { int x = extract(w->bb); printf("[extract thread] got %d\n", x); } pthread_exit(NULL); } ``` ] .rightcol[ ```c #define BUFFER_SIZE 100 int main(int argc, char **argv) { bounded_buffer b; pthread_t t1, t2; if(init_bounded_buffer(&b, BUFFER_SIZE)) { /* error */ } worker w1 = {BUFFER_SIZE*2, &b}; worker w2 = {BUFFER_SIZE*2, &b}; pthread_create(&t1, NULL, deposit_thread_fn, (void *)&w1); pthread_create(&t2, NULL, extract_thread_fn, (void *)&w2); pthread_join(t1, NULL); pthread_join(t2, NULL); destroy_bounded_buffer(&b); return 0; } ``` .codelink[
`08a-locks-barriers/lock.c`
] ] --- name: race ## Omit the Locks and Madness Ensues 🙃 .leftcol[ - Could have two threads in `deposit()` both writing to the same element of buffer - One value is lost - Could then either increment `in_index` - Once: whole call of `deposit()` lost - Twice: spurious (old) value apparently deposited ] --- template: race .rightcol[
] --- template: race .rightcol[
] --- template: race .rightcol[
] --- template: race .rightcol[
] --- template: race .rightcol[
] --- template: race - Similarly for two calls of `extract()` - Even problems between a call of `deposit()` and one of `extract()`, e.g. both change count -- .center[**Concurrency issues can be *extremely* hard to debug in medium/large scale programs**] ??? - Now if you do not protect critical sections a lot of bad things can happen - First the program may seem to behave normally on small examples, especially when the number of threads is low or when the frequency of access to shared data is low - This is bad because it's hiding race conditions - If two threads call deposit at the same time, they may write to the same slot in the buffer, one value being lost - When they increment the index for the next insertion, it can either be increment only by one assuming a similar scenario as for the buffer content - in that case we lose one of the inserted values - They can also increment the index twice and assuming we got the content overwrite problem we have a slot containing the old value - We can have similar issues in case of unprotected concurrent calls to deposit - And additional issues in case of unprotected concurrent calls to deposit and extract at the same time, for example as they both update the number of used slots we may loose consistency for that value - This may manifest in a number of ways in the program behaviour. Sometimes the program can even seem to work fine. - As a result concurrency issues can be extremely hard to reproduce and debug in large program and it's important to get your locking strategy right from the start --- # Omit the Locks and ... ```c /* BUGGY version of deposit, without locks */ void deposit(bounded_buffer *b, int message) { while (b->count == b->max_elements); b->buffer[b->in_index] = message; b->in_index = (b->in_index + 1) % b->max_elements; b->count++; } /* BUGGY version of extract, without locks */ int extract(bounded_buffer *b) { while (!(b->count)); int message = b->buffer[b->out_index]; b->out_index = (b->out_index + 1) % b->max_elements; b->count--; return message; } ``` .codelink[
`08a-locks-barriers/lock-buggy.c`
] ??? - Let's see what happen when we omit synchronised - We remove the wait and notify operations, and replace the checks with spinning loops --- # Summary - Thread need to sync up and access shared data - Need for **synchronisation mechanisms such as locks and barriers** - Parts of the program where a shared resource may be accessed by multiple threads at the same time, including at least a write access are **critical sections** - Locks enforce serial and atomic execution of these critical sections to avoid **race conditions** - Next lecture: **condition variables**, another mechanism useful for event signalling