COMP35112 - Shared Memory Multiprocessors

class: center, middle
background-image: url(include/title-background.svg)
# COMP35112 Chip Multiprocessors
<br/>
<br/>
# .white[Shared Memory Multiprocessors]

.white[Pierre Olivier]

???

- Hello everyone
- In the previous lecture we introduced how to program with threads that share
  memory for communication
- In this video, we are going to talk about how the hardware is setup to ensure
  that threads running on different cores can share memory by seeing a common
  address space
- In particular we will introduce the issue of cache coherency on multicore
  processor

---
# Multiprocessor Structure

.leftcol[
<div style="text-align:center"><img src="include/shm.svg" width=250 /></div>
]

.rightcol[
.medium[
- Most general purpose multiprocessors are shared memory
   - **easier to program**, however hardware is **more complex**
- Let's study the scalability issues of cache coherence systems
  - Focusing on bus-based ones
]]

???
- The majority of general purpose multiprocessors are shared memory
- In this model all the cores have a unified view on memory, for example here
  they can all read and write the data at address x in a coherent way
- This is by opposition to distributed memory systems where each core or
  processor has its own local memory and does not necessarily has a direct and
  coherent access to other processor's memory
- Thus shared memory multiprocessors are dominant because they are easier to program
- However shared memory hardware is usually more complex and this lead to the
  particular problem of bus-based cache coherency systems that do not really
   scale beyond a certain number of cores
- Intel/AMD, and others, have developed solutions such as Intel QuickPath and
  UltraPath Interconnect as well as AMD Coherent Hyper-transport
- In this video we will introduce the problem of cache coherency in shared memory
  multiprocessor systems, focusing first on bus-based systems
- And in the next lecture we'll see how it is managed

---
# Caches

- A high performance uniprocessor:

- Cache: **fast** and **small** local memory holding recently used data and
  instruction

???

- Recall that a high performance uniprocessor has the following structure
- Main memory is far too slow to keep up with modern processor speed
  it can take up to hundreds of cycles to access, versus the CPU registers
  that are accessed instantaneously
- So another type of on-chip memory is introduced, the cache
- It is much faster than main memory, being accessed in a few cycles
- It is also expensive so its size is relatively small, and thus the cache is
  used to maintain a subset of the program data and instructions
---
# Caches

- A high performance uniprocessor:

<div style="text-align:center"><img src="include/caches.svg" width=730 /></div>
- There may be multiple levels (L1, L2, L3)

???
- Cache can have multiple levels: generally in multiprocessors we have a level
  1 and sometimes level 2 caches that are local to each core, and a shared last
  level cache

---
# Caches

- A high performance uniprocessor:

???
- If an entire program data-set can fit in the cache, the CPU can run at full
  speed
- However it is rarely the case on modern applications and new
  data/instructions needed by the program have to be fetched from memory (on each
  cache miss)

---
# Caches

- A high performance uniprocessor:

???
- Also, newly written data in cache must eventually be written back to main
  memory

---

# The Cache Coherency Problem

- What happens with multiprocessors?

???

- With just one CPU there is no problem, data just written to the cache can be
  read correctly whether or not it has been written to memory
- But things get more complicated when we have multiple processors
- Indeed several CPUs may share data, i.e. one can write a value that the
  other needs to read
- How does that work with the cache?

---
# The Cache Coherency Problem

- What happens with multiprocessors?

???
- So consider the following situation
- We have a data x in RAM
- CPU A first reads it, then updates it in its own cache into x prime
- Then later we have CPU B that wishes to read the same data

---
# The Cache Coherency Problem

.leftcol[
- Apparently obvious solution: 'write through’ policy?
  - Every write is updated in memory

- Involves a lot of memory accesses, **negating cache benefits**
]

.rightcol[
<div style="text-align:center"><img src="include/multiprocessor-caches-3.svg" width=500 /></div>
]

???
- An apparently obvious solution would be to ensure that every write is updated in
  memory
- That's a write through cache policy
- However, this would mean that every time we write we need to write to memory
- And every time we read we also need to fetch from memory in case the data was
  updated
- This is very slow and negates the cache benefits, thus it's not a good idea

---
# The Cache Coherency Problem

- **Cache-to-cache communication**?
- How to avoid separate cache copies, i.e. **how to maintain cache coherency?**
- It gets complex, we need to develop a model
  - Topic of the next lecture

???
- So how can we overcome these issues?
- Can we communicate cache-to-cache rather than always go through memory?
- In other words, when a new value is written in one cache, all other values
  somehow located in other caches somehow would need to be either updated or
  invalidated
- Another issue is: what if two processors try to write to the same location
- In other words how to avoid having two separate cache copies?
- This is what we refer to by cache coherency
- So things are getting complex and we need to develop a model
- How to efficiently achieve cache coherency in a shared memory multiprocessor
  is the topic of the next lecture