COMP35112 - Message Passing Interface

class: center, middle
background-image: url(include/title-background.svg)
# COMP35112 Chip Multiprocessors
<br/>
<br/>
# .white[Message Passing Interface]

.white[Pierre Olivier]

???

- Hello everyone
- In this video I will talk briefly about MPI
- It's a standard defining a set of operations allowing to write distributed applications that can be massively parrallel

---
# Threads vs. OpenMP and MPI

- Multithreading as we saw it is not a panacea
  1. Shared-memory limits us to the machine boundary
  2. Lots of effort to divide parallel work
- To address these problems, two standard parallel programming frameworks:
  - **MPI: Message Passing Interface** (addresses 2., this lecture)
  - **OpenMP: Open MultiProcessing** (addresses 1., seen in a future lecture)

???
- Before jumping into MPI let's speak a bit about multithreading
- Until now most of the programming work we have been doing involve threads
- A first issue with the shared memory model adopted by multithreading is that the number of cores in your machine is a hard limit: you cannot use the computing resources of several machines in parallel
- A second problem when developing with POSIX threads is that dividing the work to be done in parallel between threads can be a lot of effort, especially if you have a lot of parallel regions
- In many cases you have an already existing sequential application and you want to parallelise it, it can take a lot of work to do so with threads
- To address these problems, we'll see two standard parallel programming frameworks that are in widespread use:
- MPI, which stands for Message Passing Interface
- It is a technology addressing the first problem, letting you run a parallel application over multiple machines
- And OpenMP, Open MultiProcessing
- It addresses the second issue, by reducing the amount of engineering effort to parallelise a shared memory application
- We'll cover MPI here, and in a future lecture we'll talk about OpenMP

---
# MPI

- **Message Passing Interface**:
  - A **standard**: set of basic functions used to create portable parallel
    applications
- Used to program a wide range of parallel architectures:
  - Shared memory machines, **supercomputers**, **clusters**
- Relies on **message passing** for communication
- **It is the most widely used API for parallel programming**
- Downside: an existing application needs to be redesigned with MPI in mind to
  be ported

???

- So what is MPI?
- It is a standard that describe a set of basic functions that can be used to create parallel applications
- These functions are available in libraries for C and Fortran
- Such application are portable between various parallel systems that support the standard
- So MPI is used to program many types of systems, not only shared memory machines as we have seen until now
- But also custom massively parallel machines also called manycores or supercomputers
- MPI can be used to run a single application over multiple machines, a cluster of computers
- An important thing to note is that, for communication between parallel execution units, MPI relies on message passing and not shared memory
- This allows an MPI application to exploit parallelism beyond a single machine boundary
- For these reasons MPI is the most widely used API for parallel programming
- Still programming in MPI has a cost: your application must be designed from scratch with message passing in mind, so porting existing applications is
  costly

---
# Messages, Processes, Communicators

.leftcol[
<div style="text-align:center"><img src="include/mpi.svg" width=400 /></div>
]

.rightcol[
- Messages: rank of sending/receiving processes + tag for type
- Process groups + communicators (e.g. `MPI_COMM_WORLD`) to restrict communications
]

???

- In MPI each parallel unit of execution is called a process
- Each process is identified by its **rank**, which is an integer
- Processes exchange **messages**, characterised by
- the rank of the sending and receiving processes
- as well as an integer tag that allows to define message "types"
- Processes are grouped, they are initially within a single group which can be
  split later
- Messages can be send over what is called a **communicator** linked to a process
  group in order to restrict the communication to a subset of the processes
- This allows code modularity, for example when you are using a library that
  has itself a set of communicating processes, you don't want these to send or
  receive messages to/from these particular processes

---
# MPI Hello World

```c
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
    MPI_Init(NULL, NULL); /* Initialise MPI */

/* Retrieve the world size i.e. the total number of processes */
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

/* Retrieve the curent process rank (id) */
    int my_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

/* Retreive the machine name */
    char machine_name[MPI_MAX_PROCESSOR_NAME];
    int nlen;
    MPI_Get_processor_name(machine_name, &nlen);

printf("Hello world from process of rank %d, out of %d processes on machine"
            " %s\n", my_rank, world_size, machine_name);

/* Exit */
    MPI_Finalize();
}
```
.codelink[<a href="src/mpi-hello.c" download>`13-mpi/mpi-hello.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a>]

???

- Here is a minimal MPI application in C
- This code will be executed by each process of our application
- We start by initialising the MPI library
- Then we get the total number of processes running for this application with
  MPI_Comm_size
- And we get the identifier of the current process, its rank
- Then we get the name of the machine into a character array, and finally we
  print all these information before exiting

---
# MPI Hello World

- To install the MPICH implementation on Ubuntu/Debian:
```bash
sudo apt install mpich
```

- To compile the aforementioned example, use the provided gcc wrapper:
```bash
mpicc listing1.c -o listing1
```

- To run:
```bash
mpirun listing1  # Default: will run a single process
mpirun -n 2 listing1 # Manually specify the number of processes to run
```

???

- To compile this application you need the MPI library installed on your system
- You have some instruction on the slide to do this for Debian or Ubuntu systems
- You can compile using mpicc
- And to execute use mpirun
- By default, depending on the version of the library, it will run only a single process, or a number of processes equals to your number of cores,
  but you can also specify that manually

---
# Blocking `send` and `recv`

```c
MPI_Send(void* data, int cnt, MPI_Datatype t, int dst, int tag, MPI_Comm com);
MPI_Recv(void* data, int cnt, MPI_Datatype t, int src, int tag, MPI_Comm com, MPI_Status* st);
```

- `data`: pointer to the data to send/receive
- `cnt`: number of items to send/receive
- `t`: type of data - e.g. `MPI_INT`, `MPI_DOUBLE`, etc.
- `dst`: rank of the receiving process
- `src`: rank of the sending process
- `tag`: integer, in `recv` can be `MPI_ANY_TAG`
- `com`: the communicator, e.g. `MPI_COMM_WORLD`
- `st`: status information

???

- To send and receive messages we have the MPI_Send and MPI_Recv functions
- They take the following parameters
- Data is a pointer to a buffer containing the data to send and receive
- As you can see it's a void star so it can point to data structure of any type
- Count is the number of items to send or receive, in the case you want to transmit an array
- t is the type of the data exchanged, there are some macros for integers, floating point numbers, and so on
- dst is the rank of the receiving process for send
- and src is the rank of the sending process for receive
- Note that there are some macros to specify sending and receiving from any process
- Com is the communicator describing the subset of processes that can participate to this communication
- And st is some status information that can be checked after receive complete

---
# MPI `send`/`recv` Example

- A counter "bounces" randomly between processes, gets incremented at each hop

???

- So let's study a simple example of an application that creates N processes
- At the beginning the first process of rank 0 creates a counter and initialises it to 0
- Then it sends this value to another random process, let's say 2
- 2 receives the value, increments it by one, and send the result to another random process
- And rinse and repeat

---
# MPI `send`/`recv` Example

```c
#define BOUNCE_LIMIT    20
int main(int argc, char** argv) {
    MPI_Init(NULL, NULL);
    int my_rank, world_size;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

if (world_size < 2) {
        fprintf(stderr, "World need to be >= 2 processes\n");
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

if(my_rank == 0) {
        int payload = 0;
        int target_rank = 0;
        while(target_rank == my_rank)
            target_rank = rand()%world_size;

printf("[%d] sent int payload %d to %d\n", my_rank, payload, target_rank);
        MPI_Send(&payload, 1, MPI_INT, target_rank, 0, MPI_COMM_WORLD);
    }
```
.codelink[<a href="src/send-recv.c" download>`13-mpi/send-recv.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a>]

The rest on the code follows in the next slide

???

- This is the first part of the code
- Remember this code will be executed by all processes of our application
- We start by getting the total number of processes and our rank
- Then the following block of code is only executed by one process, the one with rank 0
- We initialize the counter and take a target process at random
- The while loop makes sure that 0 does not send a message to itself
- Then we send the message: we pass a pointer to the payload, specify that we send one element, that it is an integer, give the rank of the target process, and specify the MPI_COMM_WORLD communicator which corresponds to all processes in the system

---

```c
    while(1) {
        int payload;
        MPI_Recv(&payload, 1, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD,
            MPI_STATUS_IGNORE);

if(payload == -1) {
            printf("[%d] received stop signal, exiting\n", my_rank);
            break;
        }

if(payload == BOUNCE_LIMIT) {
            int stop_payload = -1;
            printf("[%d] broadcasting stop signal\n", my_rank, payload);
            for(int i=0; i<world_size; i++)
                if(i != my_rank)
                    MPI_Send(&stop_payload, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
            break;
        }

payload++;
        int target_rank = my_rank;
        while(target_rank == my_rank) target_rank = rand()%world_size;

printf("[%d] received payload %d, sending %d to %d\n", my_rank,
                payload-1, payload, target_rank);

MPI_Send(&payload, 1, MPI_INT, target_rank, 0, MPI_COMM_WORLD);
    }
    MPI_Finalize();
}

```
.codelink[<a href="src/send-recv.c" download>`13-mpi/send-recv.c`</a> <a href="https://github.com/olivierpierre/comp35112-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a>]

???

- The rest of the code is executed by all processes
- We enter an infinite loop
- We start by waiting for a message reception with `MPI_Recv`
- We specify the address of a buffer that will receive the payload value, and that it will be a single int element, coming from any process
- When the payload reaches a certain limit we want to exit the application so we broadcast a special payload of -1 to all processes
- Upon reception of -1 as the payload, each process will exit the infinite loop
- In normal operation, a process receiving the payload will increment it, and will choose a random process to send it to

---
# Collective Operations

- `MPI_Barrier`: barrier synchronisation
- `MPI_Bcast`: broadcast
- `MPI_Scatter`: broadcast parts of a single piece of data to different processes
- `MPI_Gather`: inverse of ‘scatter’
- `MPI_Reduce`: a reduction operation
- etc.

???

- In addition to send and receive you have various collective operations
- They allow to easily create barriers
- to broadcast a message to multiple processes
- While broadcast sends the same piece of data to multiple processes, scatter breaks a piece of data into parts and send each part to a different process
- Gather is the inverse of scatter and allows to aggregate elements sent from multiple process into a single piece of data
- We also have Reduce, that allow to combine values sent by different processes with a reduction operation, for example an addition
- And so on
- Using such operations, many programs make little use of the more primitive Send and Recv operations

---
# Other MPI Features
- Asynchronous (non-blocking) communication: `MPI_Isend()` and `MPI_Irecv()`
  - Combined with `MPI_Wait()` and/or `MPI_Test()`
- Persistent communication
- Modes of communication
  - Standard
  - Synchronous
  - Buffered
- “One-sided” messaging
  - `MPI_Put()`, `MPI_Get()`

???

- MPI also has other features, including non blocking communications with Isend and Ireceive
- These functions do not wait for the request completion and they return directly, so in general they are used in combination with MPI_Test to check the completion of a request, and MPI_Wait to wait for a completion
- You also have a feature called persistent communication that is useful to define messages that are sent repeatedly with the same arguments
- Furthermore, there are 3 communication modes:
- With the standard one, a call to send returns when the message has been sent but not necessarily yet received
- With the synchronous one, a call to send returns when the message has been received
- And with the buffered one, the send returns even before the message is sent, something that will be done later
- Finally MPI also allows a process to directly write in the memory of another process using put and get

---
# Summary

- MPI: standard to develop parallel applications
  - Exploit parallelism
    **beyond the boundaries of a single physical machine**
  - Widely used
  - Cost: significant effort to rewrite legacy serial applications
    with MPI
  - Guide to run applications on a cluster: https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/
- Coming up: **OpenMP**
  - A framework to parallelise shared memory applications with as few effort
    as possible

???

- To conclude, MPI is a widely used standard to develop parallel applications
- Contrary to shared memory programming, it allows such applications to exploit the parallelism offered by clusters, in other words to go beyond the core limit of a single physical machine
- However this come at a cost: when porting legacy application, there is a significant engineering effort needed to redesign them with MPI in mind
- Also note that the examples we saw here ran multiple processes on a single machine
- If you want a basic guide on how to run an MPI application on a cluster of machines, check out the link on the slide
- Soon, we'll cover OpenMP
- It's a framework for shared memory programming, and one of its goal is to make parallelising existing applications as simple as possible