COMP35112 - Introduction 2

class: center, middle
background-image: url(include/title-background.svg)
# COMP35112 Chip Multiprocessors
<br/>

# .white[Introduction 2]

.white[Pierre Olivier]

???

- Hello everyone
- We have seen that due to several factors, in particular the clock frequency hitting a plateau in the mid 2000s, single core CPU performance could not really increase anymore
- As a result CPU manufacturers started to integrate multiple cores on a single chip, creating chip multiprocessors or multicores
- In this video, we'll discuss how to exploit the parallelism offered by these CPUs, and also give an overview of this course unit

---
# How to Use Multiple Cores?

???

- Now how to leverage multiple cores?
- Of course we can run 2 separate programs, each into its own process, on two different cores in parallel
- This is fine, and these two programs can be old software originally written for single core processor, they don't need to be ported
- But the real difficulty is when we want increased performance for a single application
- So we need a collection of execution flows, such as threads, all working together to solve a given problem

---
# .small[Instruction- vs. Thread-Level Parallelism]

- **Instruction Level Parallelism** (ILP)
  - Compiler/hardware **automatically** parralelises a sequential stream of
    instructions → **limited**

???

- A way to exploit parallelism is through the use of ILP, instruction level parallelism
- How does it work?
- Imagine we have a sequential program composed of a series of instructions to be executed one after the other
- A multicore-aware compiler can take this program and is able to determine what instructions can be executed in parallel on multiple cores
- Sometimes instructions can even be executed out of order, as long as there is no dependencies between them
- This is very practical because we just have to recompile the program, no modify it, so there is no effort required from the programmer
- However, the amount of parallelism we can extract with such techniques is unfortunately very limited, due to the dependencies between instructions

---
# .small[Instruction- vs. Thread-Level Parallelism]

- **Thread-Level Parallelism**
  - The programmer divides the program into (long) sequences of instructions
    ran in parallel

???

- So another way to exploit parallelism is to rewrite your application with parallelism in mind
- We divide the work to be done into several execution flows named threads that run in parallel on multiple cores
- Now because of scheduling we don't have control over the order in which most of the threads' operations are realised, so for the computations to be meaningful the threads need to synchronise somehow

---
# Thread Level Parallelism

- We'll program with **threads** in the labs
- We will divide programs into concurrent sections executing as threads on different cores
- Main issue: **data sharing between threads**
  - *What happens if a thread reads a variable currently being written by
    another thread?*
  - Brings the need for **synchronisation** (keyword `synchronised` in Java)

???

- In the labs exercises we will program with threads
- We'll divide the task to be achieved into sections, executed concurrently by threads, hopefully on different core
- One of the main difficulties multithreading programming is how to share data between threads
- If we have a shared variable, because we don't control directly the scheduling of threads, what happens if one thread running on one core writes to the
  variable at the same time another thread is reading this variable?
- This brings the need for synchronisation, for threads to decide on a particular order or at least on some basic rules for accessing shared data without stepping on each others toes

---

# Thread Level Parallelism

- Set of threads belonging to the same program can run on a single core, total program's execution time is sum of each thread's exec. time
- On multicores threads can run in parallel, ideally execution time is `sequential execution time / number of parallel threads`

???
- A set of threads belonging to the same program can run on one or multiple cores
- On one core their execution is interlaced, they are time-sharing the core
- In that case the total execution time is simply the sum of each thread's execution time
- On multiple cores, ideally the total execution time would be the time of the sequential version of the program divided by the number of threads
- Of course many things can make that it is not the case, for example when some threads have to wait for other threads to finish, or when there is not enough cores to run all threads in parallel
- On the example, the total execution time on two cores is not exactly the time on one core divided by 2
--

- ILP is limited but TLP is "general purpose" and can be used to generate
  large amount of parallelism
  - At the cost of programmer's effort + program must be suitable

???

- Contrary to ILP which is limited, TLP allows to exploit much further the parallelism brought by multiple cores
- However that means rewriting the application to use threads so there is an effort needed from the programmer
- The application itself also needs to be suitable for being divided into several execution flows

---

# Data Parallelism

- Exploit structured parallelism contained in specific programs
- Data parallelism is usually associated with computation on a
  multi-dimensional array
- **Many array computations perform the same or very similar computation on all
  elements**

???
- The data manipulated by some specific programs, as well as the way it is manipulated, is very well fitted for parallelism
- For example if you consider applications doing computations on single and multi-dimensional arrays
- Here we have a matrix-matrix addition, where each element of the result matrix can be computed in parallel
- This is called data parallelism

---

# Data Parallelism Examples
- **General**
  - Matrix multiply (used heavily in CNNs, for example)
  - Fourier transform
- **Graphics**
  - Anti-aliasing
  - Texture mapping
  - Illumination and shading
- **Differential Equations**
  - Weather/climate forecasting
  - Engineering simulation (and “Physics” in Games)
  - Financial modelling

???

- Here are a few examples of application domain that tend to exhibit data parallelism
- We have matrix or array operations operations, very common in AI applications, Fourier Transform
- A lot of graphical computations like filters, anti-aliasing, texture mapping, light and shadow computations
- As well as differential equations applications like weather forecasting, engineering simulations, and financial modelling

---

# Complexity of Parallelism

- Parallel programming is generally considered to be difficult, but depends a
  lot on the program structure

???

- Now overall, parallel programming is considered to be relatively difficult, but it really depends on the structure of the program you develop
- If you have a scenario where all the parallel sections, let's say you are using threads,  where all the threads are doing the same thing, and they don't share much data, then it can be quite straightforward
- On the other end, in situations where all the threads are doing something different, or when they share a lot of data in write mode, when they communicate a lot and need to synchronise, then such programs can be quite hard to reason about, to develop and to debug

---
# Chip Multiprocessor Considerations
- **How should we build the hardware?**
  - How are cores connected?
  - How are they connected to memory?
  - Should they reflect particular parallel programming patterns (e.g. data
    parallelism)?
  - Simple vs. complex cores?
  - General vs. Special Purpose (e.g. graphics processors)?
???
- There are two main categories of issues with chip multiprocessors that we will cover in this course unit
- First in terms of hardware, we will consider questions like
- When we have multiple cores, how are they connected together?
- How are they connected to memory?
- Are they supposed to be used for particular programming patters such as data parallelism? or multithreading?
- If we want to build a multicore, should we use a lot of simple cores or just a few complex cores?
- Should the processor be general purpose, or specialised towards particular workloads?
--

- **How should we program them?**
  - Extended ‘conventional’ languages?
  - Domain specific languages?
  - Totally new approaches?

???

- And then we have problematics regarding software, how to program these processors?
- Can we use a conventional programming language? maybe an extended version?
- Should we rather use a specific language, or a totally new approach?

---
# Overview of Lectures

- Thread-based programming, thread synchronisation
- Cache coherency in homogeneous shared memory multiprocessors
- Hardware support for thread synchronisation
- Operating system support for threads, concurrency within the kernel
- Alternative programming views
- Speculation and transactional memory
- Heterogeneous processors/cores/programs
- Radical approaches (e.g. dataflow programming)

???

- So after this introduction, we'll see some general notions about the world of parallelism
- Then we will cover multithreading programming
- Shared memory multiprocessors and cache coherency
- Synchronisation primitives between threads, and what are the low-level hardware mechanism enabling these concepts
- We will also cover programming approaches other than threads
- Hardware optimisations like speculation and transactional memory, heterogeneous processors, cores and program
- GPUs, how they are programmed
- And some more radical approaches like dataflow programming

---
# Summary

- Single core performance has been plateauing but we can still pack more transistors
  on a single chip
- Put multiple compute units on a single integrated circuit: **chip
  multiprocessors**
- Has important implications
  - Hardware: what CPUs to use, how are they connected, do they shared memory?
  - Software: how to program these things?
- Interesting read: http://www.gotw.ca/publications/concurrency-ddj.htm

???

- To summarise, this introduction we have seen that single core performance has been plateauing for quite some time now, and as a result constructors put more cores on processors chips
- This has important implications, in terms of both hardware and software
- I encourage to read this article from 2005 entitled "the free lunch is over" that provides further information about many of the topics I covered here
- And that's it, in the next video you will learn more about the world of parallelism