COMP26020p1 - Case Study: High Performance Computing

class: center, middle

### COMP26020 Programming Languages and Paradigms Part 1: C Programming
***
# Case Study: High Performance Computing

---
# C/C++ in HPC

.center[C/C++ are extensively used in High Performance Computing (HPC) because
of their **speed**]

- Due to many reasons, including the fact that they **gives the programmer
control over the data memory layout**

???

- The languages C and C++ are extensively used in HPC because they are **fast**
- This speed is due to many reasons
- By opposition to more high level programming languages such as python depicted
  on the slide
- First, these languages are close to the hardware, there are few layers to
  traverse when calling system functions
- For python your script need to run through the interpreter, for Java same
  thing you have the JVM, and so
- These additional layer introduce much overheads
- On the contrary with C or C++ there is no garbage collection introducing
  nondeterministic delays at runtime
- There is also no bound checking slowing down memory accesses
- It's very Easy to integrate with assembly for optimisations
- And last but not least it **gives the programmer control over the data memory layout**

---
# Controlling Memory Layout

- For performance reasons it is very important to fit as much of the data set
  as possible in the cache

???

- Speaking about memory, here is a quick recap about a subset of the memory
  hierarchy present on all computers
- The CPU has local memory in the form of registers, there are very few of
  these, between 10 and 30, and they are extremely fast to access: data present
  in registers is instantaneously available for computations
- on the other side we have the main memory that stores most of the program
  data because it is large so bringing data from memory to registers
  for computations can take hundreds of CPU clock cycles
- so we have this intermediary layer named the cache, it is made of a relatively
  small amount of fast memory (faster than the RAM) that holds a subset of the
  program's data
- data is loaded in the cache from memory on demand when requested by the CPU,
  and the unit of transfer between memory and the cache is a cache line,
  generally 64 contiguous bytes, to exploit the principle of locality
- Now to get the best performance possible for a program -- and this is the
  goal in HPC -- it is important to fit as much as possible of the program's
  data in the cache

---
# Controlling Memory Layout

.leftcol[
```c
typedef struct {
    char c[60];
    int i;
    double d;
} my_struct;

#define N 100000000
my_struct array[N];

int main(int argc, char **argv) {
    struct timeval start, stop, res;
    my_struct s;

gettimeofday(&start, NULL);

/* Randomly access N elements */
    for(int i=0; i<N; i++)
        memcpy(&s, &array[rand()%N],
            sizeof(my_struct));

gettimeofday(&stop, NULL);
    timersub(&stop, &start, &res);
    printf("%ld.%06ld\n", res.tv_sec,
        res.tv_usec);

return 0; }
```
.codelink[<a href="src/original.c" download>`20-hpc-case-study/original.c`</a> <a href="https://github.com/olivierpierre/comp26020-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/52EgA0Nbrfm69" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]
]

.rightcol[
- Struct size: 60 + 4 (int) + 8 (double) = 72 bytes
  - Larger and not a multiple of the cache line size (64 bytes)
  - Most objects in the array will require to fetch 2 cache lines from main memory

<div style="text-align:center"><img src="include/alignment.svg" width=400 /></div>
]

???

- Let's take an example
- We have a program that works with a large array of data structures
- The data structure contains a string member, an int member and a double member
- And what the program does is simple, in a relatively long loop it uses memcpy
  to read from the array a given member from a random index
- We use gettimeofday to measure the execution time of the loop
- This program is extremely memory intensive, there is not much computations
  and it is really memory bound by the memory copy operations
- And it turns out that this program is not very cache friendly
- If we look at the size of one member of the array, that is the sizeof my_struct, we can see that it's 60 + 4 + 8 = 72 bytes
- This is slightly larger than a cache line which is 64 bytes
- So in memory the array is laid out like that and most most memcpy in my program's inner loop will require to fetch 2 cache lines from memory

---
# Controlling Memory Layout

```c
typedef struct {
    char c[52]; // down from 60, we have 52 + 4 + 8 == 64 bytes i.e. a cache line
    int i;
    double d;
} my_struct;

my_struct array[N] __attribute__ ((aligned(64)));  /* force alignment of the array itself */

/* ... */
```
.codelink[<a href="src/optimised.c" download>`20-hpc-case-study/optimised.c`</a> <a href="https://github.com/olivierpierre/comp26020-devcontainer" target="_blank" style="text-decoration: none"><img src="include/gh-logo.svg" style="height: 1em" ></a><a href="https://www.programiz.com/online-compiler/61UjcPpl8R8hG" target="_blank" style="text-decoration: none"><img src="include/programiz-logo.png" style="height: 1em"></a>]

.leftcol[
  <div style="text-align:center"><img src="include/graph.svg" width=400 /></div>
]

.rightcol[
- About 25% faster!
- How much is it on your computer? check out your CPU's cache line size with
`cat /sys/devices/system/cpu/cpu0/`
`cache/index0/coherency_line_size`

]

???

- We can fix that very easily
- First we reduce the size of a struct to be equal to that of a cache line
- I reduce a bit the size of the string so that one struct size is 64 bytes
- Then I use the special keyword attribute to make sure that the array is
aligned to a cache line
- In effect this will make that each member of the array is fully contained in a
  cache line
- On my computer this effectively speeds up the program by about 25%
- We can do that in C and C++ because we have a large area of control over the
  program's memory layout, in other word we can control the placement and sizes
  of data structure sin memory

---
# Summary

- C/C++ extensively used in HPC because of their speed
  - Run close to the hardware
  - No additional software layers
  - No runtime overhead
  - Integrates well with assembly
  - **Control of the memory layout**

----
.center[Feedback form: https://bit.ly/3yz2jzh]
<div style="text-align:center"><img src="include/qr-code.png" height=150 /></div>

???

- And that's it for this video, to recap
- C and C++ are used a lot in HPC because they are fast
- It is due to several reasons listed on the slide, and in particular to the fact that they give to the programmer a high degree of control over the program memory layout
- In the next video we will see another interesting use case for C and C++, that are very useful in low level userspace systems software: the implementation of the C std library itself