+ - 0:00:00
Notes for current slide
Notes for next slide
  • The languages C and C++ are extensively used in HPC because they are fast
  • This speed is due to many reasons
  • By opposition to more high level programming languages such as python depicted on the slide
  • First, these languages are close to the hardware, there are few layers to traverse when calling system functions
  • For python your script need to run through the interpreter, for Java same thing you have the JVM, and so
  • These additional layer introduce much overheads
  • On the contrary with C or C++ there is no garbage collection introducing nondeterministic delays at runtime
  • There is also no bound checking slowing down memory accesses
  • It's very Easy to integrate with assembly for optimisations
  • And last but not least it gives the programmer control over the data memory layout

COMP26020 Programming Languages and Paradigms Part 1: C Programming


Case Study: High Performance Computing

1 / 6

C/C++ in HPC

C/C++ are extensively used in High Performance Computing (HPC) because of their speed

  • Due to many reasons, including the fact that they gives the programmer control over the data memory layout
2 / 6
  • The languages C and C++ are extensively used in HPC because they are fast
  • This speed is due to many reasons
  • By opposition to more high level programming languages such as python depicted on the slide
  • First, these languages are close to the hardware, there are few layers to traverse when calling system functions
  • For python your script need to run through the interpreter, for Java same thing you have the JVM, and so
  • These additional layer introduce much overheads
  • On the contrary with C or C++ there is no garbage collection introducing nondeterministic delays at runtime
  • There is also no bound checking slowing down memory accesses
  • It's very Easy to integrate with assembly for optimisations
  • And last but not least it gives the programmer control over the data memory layout

Controlling Memory Layout

  • For performance reasons it is very important to fit as much of the data set as possible in the cache
3 / 6
  • Speaking about memory, here is a quick recap about a subset of the memory hierarchy present on all computers
  • The CPU has local memory in the form of registers, there are very few of these, between 10 and 30, and they are extremely fast to access: data present in registers is instantaneously available for computations
  • on the other side we have the main memory that stores most of the program data because it is large so bringing data from memory to registers for computations can take hundreds of CPU clock cycles
  • so we have this intermediary layer named the cache, it is made of a relatively small amount of fast memory (faster than the RAM) that holds a subset of the program's data
  • data is loaded in the cache from memory on demand when requested by the CPU, and the unit of transfer between memory and the cache is a cache line, generally 64 contiguous bytes, to exploit the principle of locality
  • Now to get the best performance possible for a program -- and this is the goal in HPC -- it is important to fit as much as possible of the program's data in the cache

Controlling Memory Layout

typedef struct {
char c[60];
int i;
double d;
} my_struct;
#define N 100000000
my_struct array[N];
int main(int argc, char **argv) {
struct timeval start, stop, res;
my_struct s;
gettimeofday(&start, NULL);
/* Randomly access N elements */
for(int i=0; i<N; i++)
memcpy(&s, &array[rand()%N],
sizeof(my_struct));
gettimeofday(&stop, NULL);
timersub(&stop, &start, &res);
printf("%ld.%06ld\n", res.tv_sec,
res.tv_usec);
return 0; }

20-hpc-case-study/original.c

  • Struct size: 60 + 4 (int) + 8 (double) = 72 bytes

    • Larger and not a multiple of the cache line size (64 bytes)
    • Most objects in the array will require to fetch 2 cache lines from main memory
4 / 6
  • Let's take an example
  • We have a program that works with a large array of data structures
  • The data structure contains a string member, an int member and a double member
  • And what the program does is simple, in a relatively long loop it uses memcpy to read from the array a given member from a random index
  • We use gettimeofday to measure the execution time of the loop
  • This program is extremely memory intensive, there is not much computations and it is really memory bound by the memory copy operations
  • And it turns out that this program is not very cache friendly
  • If we look at the size of one member of the array, that is the sizeof my_struct, we can see that it's 60 + 4 + 8 = 72 bytes
  • This is slightly larger than a cache line which is 64 bytes
  • So in memory the array is laid out like that and most most memcpy in my program's inner loop will require to fetch 2 cache lines from memory

Controlling Memory Layout

typedef struct {
char c[52]; // down from 60, we have 52 + 4 + 8 == 64 bytes i.e. a cache line
int i;
double d;
} my_struct;
my_struct array[N] __attribute__ ((aligned(64))); /* force alignment of the array itself */
/* ... */

20-hpc-case-study/optimised.c

  • About 25% faster!
  • How much is it on your computer? check out your CPU's cache line size with cat /sys/devices/system/cpu/cpu0/ cache/index0/coherency_line_size
5 / 6
  • We can fix that very easily
  • First we reduce the size of a struct to be equal to that of a cache line
  • I reduce a bit the size of the string so that one struct size is 64 bytes
  • Then I use the special keyword attribute to make sure that the array is aligned to a cache line
  • In effect this will make that each member of the array is fully contained in a cache line
  • On my computer this effectively speeds up the program by about 25%
  • We can do that in C and C++ because we have a large area of control over the program's memory layout, in other word we can control the placement and sizes of data structure sin memory

Summary

  • C/C++ extensively used in HPC because of their speed
    • Run close to the hardware
    • No additional software layers
    • No runtime overhead
    • Integrates well with assembly
    • Control of the memory layout

Feedback form: https://bit.ly/3yz2jzh

6 / 6
  • And that's it for this video, to recap
  • C and C++ are used a lot in HPC because they are fast
  • It is due to several reasons listed on the slide, and in particular to the fact that they give to the programmer a high degree of control over the program memory layout
  • In the next video we will see another interesting use case for C and C++, that are very useful in low level userspace systems software: the implementation of the C std library itself

C/C++ in HPC

C/C++ are extensively used in High Performance Computing (HPC) because of their speed

  • Due to many reasons, including the fact that they gives the programmer control over the data memory layout
2 / 6
  • The languages C and C++ are extensively used in HPC because they are fast
  • This speed is due to many reasons
  • By opposition to more high level programming languages such as python depicted on the slide
  • First, these languages are close to the hardware, there are few layers to traverse when calling system functions
  • For python your script need to run through the interpreter, for Java same thing you have the JVM, and so
  • These additional layer introduce much overheads
  • On the contrary with C or C++ there is no garbage collection introducing nondeterministic delays at runtime
  • There is also no bound checking slowing down memory accesses
  • It's very Easy to integrate with assembly for optimisations
  • And last but not least it gives the programmer control over the data memory layout
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow