class: center, middle background-image: url(include/title-background.svg) # COMP35112 Chip Multiprocessors
# .white[Parallel Programming Using Shared Memory] .white[Pierre Olivier] --- class: inverse, center, middle # Threads --- name: proc # Processes - A program executing on the CPU runs as a **process** - With virtual memory, the process gets the illusion it has access to 100% of the memory - **Address space** ??? - When a program executes on the CPU, on top of the operating system, it runs as an entity named a process - Using virtual memory, the OS gives to the process the illusion that it can access all the memory, so each process in the system is free to use any address in the lower half of the 64 bits address space, the upper one being reserved for the OS - That area of memory that the process can address is named the address space --- template: proc - Two programs: 2 different processes, i.e. 2 different address spaces
??? - Two processes have two separate address spaces - For example here process A and B both have some data at address 42, however that data is different --- template: proc - Two programs: 2 different processes, i.e. 2 different address spaces
??? - This is because the virtual address `0x42` for both processes is mapped to different physical location in the page table - Indeed each process has its own page table. - To put things simply, 1 process == 1 virtual address space == 1 page table. - The OS also makes sure that processes cannot address anything outside of their address space - To leverage parallelism, two processes, i.e. two programs, can run on two different cores of a multicore processor - However, how can we use parallelism within the same program? -- Can run 2 independent processes in parallel on a multicore, but **how to parallelise a single program**? --- name: threads # Threads - A thread is a sequence of instructions executed on a CPU core - A process consists of one or more threads - Each process in the system has an address space - **All threads in the same process share the same address space** ??? - A thread is a flow of control executing a program - It is a sequence of instructions executed on a CPU core - The nice thing with threads is that a process can consist of one but also multiple threads - In other words, we can have for a single program several execution flows running on different cores and sharing a single address space --- template: threads
--- template: threads
??? - In our example process A has 3 threads, and they all see the green and orange data. - Process B has 4 threads, and they see the same red and yellow data --- template: threads .center[**Threads communicate using shared memory**] ??? - Seeing the same address space is very convenient for communications, that can be achieved through global variables or pointers to anywhere in the address space --- # Threads - Can program with threads in a plethora of languages: - **C/C++**/Fortran **using the POSIX threads library** (Pthreads) - Java - Python - Go - Haskell - Rust - etc. -- - Thread support in higher-level languages generally builds upon pthreads - Most of these languages are built in or at least contain some C/C++ elements ??? - Threads can be created in many languages, including C/Fortran using the POSIX threads (Pthread) library - But also Java, as we will see in detail in this course - And many other languages --- # Threads in C/C++ with Pthread .leftcol[ - **POSIX thread library** - Expose a standardised C/C++ API to create and manage multiple threads within a process - Standard program: OS creates automatically a single thread ] .rightcol[
] --- # Threads in C/C++ with Pthread .leftcol[ - Use `pthread_create()` to create and launch a thread - Indicate as parameters: - Which function the thread should run - Optionally what should be passed as parameter to this function ] .rightcol[
] --- # Threads in C/C++ with Pthread .leftcol[ - Use `pthread_create()` to create and launch a thread - Indicate as parameters: - Which function the thread should run - Optionally what should be passed as parameter to this function - `pthread_exit()` to have the calling thread exit ] .rightcol[
] --- # Threads in C/C++ with Pthread .leftcol[ - Pthread -- POSIX thread library - Use `pthread_create()` to create and launch a thread - Indicate as parameters: - Which function the thread should run - Optionally what should be passed as parameter to this function - `pthread_exit()` to have the calling thread exit - `pthread_join()` to wait for another thread to finish ] .rightcol[
] --- # Threads in C/C++ with Pthread - A good chunk of this course, including labs 1 and 2, will focus on shared memory programming in C/C++ with pthreads - `man pthread_*`
1
and Google “pthreads” for lots of documentation - In particular see the Oracle Multithreaded Programming Guide: https://bit.ly/3FGt3k2
1
First install the relevant development man pages: ```shell sudo apt install manpages-dev manpages-posix manpages-posix-dev ``` ??? --- # Threads in C/C++ with Pthread .leftcol[ ```c // Compile with: // gcc pthread.c -o pthread -lpthread #include
#include
#include
// pthread header #define NOWORKERS 5 // Function executed by all threads void *thread_fn(void *arg) { int id = (int)(long)arg; printf("Thread %d running\n", id); pthread_exit(NULL); // exit // never reached } ``` ] .rightcol[ ```c int main(void) { // Each thread is controlled through a // pthread_t data structure pthread_t workers[NOWORKERS]; // Create and launch the threads for(int i=0; i
`03-shared-memory-programming/pthread.c`
] ] --- # Threads in C/C++ with Pthread
--- # Threads in C/C++ with Pthread
--- # Threads in Java ```java class MyThread extends Thread { // create a class that inheritates from Thread int id; MyThread(int id) { this.id = id; } // override the method run() to define the code executed by the thread: public void run() { System.out.println("Thread " + id + " running"); } } class Main { public static void main(String[] args) { int NOWORKERS = 5; MyThread[] threads = new MyThread[NOWORKERS]; for (int i = 0; i < NOWORKERS; i++) threads[i] = new MyThread(i); for (int i = 0; i < NOWORKERS; i++) threads[i].start(); // start executing the threads for (int i = 0; i < NOWORKERS; i++) try { threads[i].join(); } catch (InterruptedException e) { /* do nothing */ } System.out.println("All done"); } } // compile and launch with: //javac java-thread.java && java Main ``` .codelink[
`03-shared-memory-programming/java-thread.java`
] ??? - Let's see an example - We have the MyThread class inheriting from Thread - In the constructor we just initialized an identifier integer member variable - And we implement the run method, it just prints the fact that the thread is running - In the main function we create an array of objects from this class - Each is given a different id - Then we launch them all in a loop with start, and we wait for each to finish with join --- # Threads in Python ```python #!/usr/bin/python3 import threading NOWORKERS=5 def print_hello(id): print("Thread " + str(id) + " running") threads = [] for i in range(NOWORKERS): thread = threading.Thread(target=print_hello, args=(i,)) threads.append(thread) thread.start() for thread in threads: thread.join() print("All done.") ``` .codelink[
`03-shared-memory-programming/python.py`
] --- # Threads in Rust ```rust use std::thread; const NOWORKERS: u32 = 5; fn print_hello(id: u32) { println!("Thread {} running", id); } fn main() { let mut handles = vec![]; for i in 0..NOWORKERS { let handle = thread::spawn(move || { print_hello(i); }); // `move ||` transfers handles.push(handle); // ownership of i inside } // the closure for handle in handles { handle.join().unwrap(); // unwrap panics if the thread's execution returns an error } println!("All done."); } // With Rust toolchain installed, compile with: // rustc
``` .codelink[
`03-shared-memory-programming/rust.rs`
] --- # Examples Output .leftcol[ For all examples: ```bash Thread 1 running Thread 0 running Thread 2 running Thread 4 running Thread 3 running All done ``` - **No control over the order of execution!** - The OS scheduler decides, it's nondeterministic ] -- .rightcol[ A possible scheduling scenario:
] -- .rightcol[ Another one on 1 core:
] ??? - This is an example of the output we can get, for both the java and the C example - If we run the program multiple times we get different order of execution - This is normal, we have no control over this as it is the OS scheduler that decides - Of course this can be problematic in some situations where we really need to set up a proper order of execution, for example when a thread needs to accomplish a certain task before another thread can start doing its job - Later in the course we will see how to use synchronisation mechanisms to manage this --- class: inverse, center, middle # Data Parallelism #### Dividing Work between Threads --- # Data Parallelism - Simple form of parallelism commonly found in many applications - Common in computational science applications - Divide computation into (nearly) equal sized chunks - Works best when there are **no data dependencies** between chunks ??? - As we saw previously, data parallelism is an easy form of parallelism found in many applications such as computational science - There is data parallelism when we have some data that is structured in such a way that the operations to be performed on it can easily be parralelised - This is very suitable for parallelism, the goal is to divide the computations into chunks, computed in parallel --
- Exploited in multithreading but also at the instruction level in vector/array (SIMD) processors: CPUs (SSE, AVX), and in GPGPUs ??? - You can see an example on the slide in the form of a sum between two arrays - We sum each element of same index in A and B and put the result in C - For example we can have a different thread realise each addition, or fewer threads each taking care of a subset of the additions - This kind of parallelism is exploited in vector and array processors - General purpose CPU architectures have vector instructions extensions allowing things like applying the same operation on all elements of an array - For example Intel x86-64 use to have SSE and now AVX - From a very high level perspective, GPUs also work in that way --- name: mat # Data Parallelism Example - **Matrix multiplication** is a good example - No dependencies between the computation of each member of the result matrix ??? - Let's see a second example more in details, matrix-matrix multiplication - We'll use a square matrix for the sake of simplicity --- template: mat
??? - Remember that with matrix multiplication, each element of the result matrix is computed as the sum of the multiplication of the elements in one column of the first matrix with the element in one line of the second matrix - So how can we parallelise this operation? --- template: mat - *n
2
* parallel threads (1 per result element)
??? - First we can have 1 thread per element of the result matrix, each thread computing the value of the element in question - With a n x n matrix that gives us n squared threads --- template: mat - *n
2
*parallel threads (1 per result element) - *n* parallel threads (1 per row/column of result)
??? - If we don't have a lot of cores we may want to create fewer threads, it's generally not very efficient to have more threads than cores - So another strategy is to have a thread per row or per column of the result matrix --- template: mat - *n
2
*parallel threads (1 per result element) - *n* parallel threads (1 per row/column of result) - *p* parallel threads, each computing *q* rows/columns of the result, where *pq = n*
??? - Finally, we can also create an arbitrary number of threads by dividing the number of elements of the result matrix by the number of threads that we want and have each thread take care of a subset of the elements --- class: inverse, center, middle # Implicit vs. Explicit Parallelism --- # Implicit vs. Explicit Parallelism - When parallelising a program an important consideration is that of the programmer's **effort**: - **What is the best strategy according to the situation?** - Does the programmer need to be an expert to make that choice? - **How to indicate the chosen strategy in the code?** - If we already have a sequential version of the program, how much code refactoring & new code implementation is needed? --- # NxN Parallel Matrix Multiplication .leftcol[ ```c #define N 1000 int A[N][N]; int B[N][N]; int C[N][N]; int main(int argc, char **argv) { /* init matrices here */ for(int i=0; i
`03-shared-memory-programming/matmult.c`
] ] .rightcol[
- Basic sequential code for matrix multiplication ] --- # NxN Parallel Matrix Multiplication .leftcol[ ```c *#include
#define N 1000 int A[N][N]; int B[N][N]; int C[N][N]; int main(int argc, char **argv) { /* init matrices here */ *#pragma omp parallel { /* First loop parralelised */ for(int i=0; i
`03-shared-memory-programming/openmp.c`
] ] .rightcol[
- Automatic parallelisation using OpenMP - Very low programmer effort - More on OpenMP later - Compile and run: `gcc openmp.c -o prog -fopenmp && OMP_NUM_THREADS=8 ./prog` **Parallelising with pthread would require much more code rewrite** ] ??? - So still on our matrix-matrix multiplication example, here is an example of a cool thing we can do in C with a library called open MP - We can parallelise the matrix multiplication operation in a very simple way by adding a simple annotation in the code - All the iteration of the first outer loop will be done in parallel by different threads - The framework is taking care of managing the threads, and also to spawn the ideal number of threads according to the machine executing the program --- # Explicit vs. Implicit Parallelism - **Explicit parallelism** - The programmer explicitly spells out what should be done in parallel/sequence - Code modifications needed if sequential program already available - Examples: reworking a program to use threads or processes for parallel execution ??? - This notion of programmer's effort lead us to the concepts of explicit vs. implicit parallelism - With explicit parallelism, the programmer has to write more or less code to indicate what should be done in parallel and what should be done ins sequence - This ranges from writing all the thread code manually, or just putting some annotations in sequential program s(OpenMP) -- - **Implicit parallelism** - No effort from the programmer, system works out parallelism by itself - No code modification over an already existing sequential program - Done for example by some languages able to make strong assumptions about data sharing (e.g. pure functions), or with ILP ??? - And with implicit parallelism, the system work out the parallelism fully automatically - Some languages (e.g. functional languages) have different views of computation which allow a default assumption of parallelism (no updatable shared state between functions) --- # Example Code for Implicit Parallelism Fortran 77 sequential array addition: ```fortran DO I = 1, N DO J = 1, N C(J, I) = A(J, I) + B(J, I) END DO END DO ``` In Fortran 90 with implicit parallelism: ```fortran C = A + B ``` --- # Automatic Parallelisation - In an ideal world, all parallelism would be implicit: the compiler and/or the hardware would **take an ordinary sequential program and derive the parallelism automatically** ??? - So of course implicit parallelism is the best type of parallelism because the programmer does not have to do anything - If we have a sequential program it would be great if the compiler can automatically extract all the parallelism! -- - Manufacturers of pre-multicore parallel machines invested considerably in such technology - Can do quite well if the programs are simple enough but **dependency analysis can be very hard** - Must be conservative - If you cannot be certain that parallel version computes **correct** result, can't parallelise ??? - There was a lot of effort invested in such technologies at the time of the first parallel machines, before the multicore era - It works well on small programs but in the general case, analysing dependencies in order to define what can be done in parallel and what needs to be done sequentially becomes very hard - And of course the compiler needs to be conservative not to break the program, i.e. if it is not 100% sure that two steps can be run in parallel they need to run sequentially - So overall performance gains through implicit parallelism are quite limited, and to get major gains one need to go the explicit way --- # Example Problems for Parallelisation - Can the compiler automatically parallelise the execution of these loops' iterations? - I.e. run all or some iterations in parallel ```c for (int i = 0 ; i < n-3 ; i++) { a[i] = a[i+3] + b[i] ; // at iteration i, read dependency with index i+3 } for (int i = 5 ; i < n ; i++) { a[i] += a[i-5] * 2 ; // at iteration i, read dependency with index i-5 } for (int i = 0 ; i < n ; i++) { a[i] = a[i + j] + 1 ; // at iteration i, read dependency with index ??? } ``` ??? - Now here are a few examples of dependencies that a compiler may face when trying to extract parallelism automatically - They all regard parallelising all of some iterations of a loop - In the first loop 3 slots after the index computed at each iteration we have a read dependency - In the second loop we have another read dependency, this time 5 slots before the index computed at each iteration - Then in the last case, still a read dependency, we don't know if the slot read is before or after the one being computed - This is quite important as we are going to see --- # Automatic Parallelisation .leftcol[ ```c for (int i=0; i
] -- Can parallelise all iterations by making a new version of array `a` ```c parrallel_for(int i=0; i
] --- # Automatic Parallelisation .leftcol[ ```c for (int i = 5 ; i < n ; i++) { a[i] += a[i-5] * 2 ; } ``` - Previous trick does not work here: **this time we read values computed by the loop itself** - At each iteration i we read what was computed by the loop at iteration i-5 - **Solution: limit parallelism to 5** ] .rightcol[
] ??? - Now with a dependency such as this one, the trick from the previous slide won't work! - Because what we read at iteration i is supposed to have been written 5 iteration before, we can't rely on a read-only copy - Also, parallelising all iteration will break the program - We observe that at a given time, there is no dependency between sets of 5 iterations, such as the five first ones. - The solution here is to limit the parallelism to 5 --- # Shared Memory - For this lecture we assumed that threads **share memory** - I.e. they all have access to a common address space - Multiple threads can read and write in the same memory location (variable, buffer, etc.) through - Global variable - Pointers and references -- - No shared memory? - Need to communicate via **message passing** - Close to a distributed system - We'll talk briefly about MPI (Message Passing Interface) later in the course unit ??? - One last thing - Everything in this lecture has been on the basis that threads share memory - In other words they can all access the same memory, and they will all see the same data at a given address - Language scope still apply, so threads can declare local variables that are private, but shared stuff is accessible from every thread - If you don’t have shared memory, threads have to communicate via messages – more like a distributed system. We will talk about MPI in another lecture --- # Summary and Next Lecture - **In shared memory systems, a parallel program can use threads** - Threads communicate implicitly by reading and writing to a common address space - **A simple form of parallelism: data parallelism** - Apply similar operations on chunks of a data set - Efficient when there is no data dependency - **Automatic parallelisation can be limited by such dependencies** - Shared memory is a practical form of implicit communication - It's great that multicore today share memory right? - Next lecture will discuss why this isn’t as simple as it sounds!