COMP35112 - Cache Coherence in Multiprocessors

class: center, middle
background-image: url(include/title-background.svg)
# COMP35112 Chip Multiprocessors
<br/>
<br/>
# .white[Cache Coherence in Multiprocessors]

.white[Pierre Olivier]

---
name: single
# Single CPU Cache 101

- On-chip SRAM, access from the core much **faster** than off-chip RAM (<10 to a few tens of cycles)
- Expensive so present in **small quantity** (KBs to tens of MBs): holds the most recently accessed data/instructions
  - Leverage **spatial/temporal locality principles**

---
template: single
<div style="text-align:center"><img src="include/single-core-1.svg" width=600 /></div>

---
template: single
<div style="text-align:center"><img src="include/single-core-2.svg" width=600 /></div>

---
template: single
<div style="text-align:center"><img src="include/single-core-3.svg" width=600 /></div>

---
template: single
<div style="text-align:center"><img src="include/single-core-4.svg" width=600 /></div>

---
template: single
<div style="text-align:center"><img src="include/single-core-5.svg" width=600 /></div>

---
name: issue

# Cache Coherence in Multiprocessors

.leftcol[
  - Each core has a local cache
- **Cache coherence**: avoid having multiple different copies of the same data
  in different caches of a shared memory multiprocessor
]

---
template: issue

.rightcol[
<div style="text-align:center"><img src="include/issue-1.svg" width=530 /></div>
]

---
template: issue

.rightcol[
<div style="text-align:center"><img src="include/issue-2.svg" width=530 /></div>
]

~~
<br>
.center[
> Need **cache to cache communication** for performance to avoid involving the slow memory
]

???

- So I hope you all watched the video from last week
- It was introducing the problem of cache coherence in multiprocessors
- The problem is that with multiprocessor each core has a local cache, that may
  not be in sync with memory
- We need to avoid situations where two cores have multiple copies of the same
  data with different values
- If we use a traditional single core cache system, the following can happen
- At first we have the data x in memory and core A reads it then updates it to
  x prime
- A does not write back with memory yet and later B wants to read the data so
  it reads x which is not the last version
- This of course breaks the program
- So we need to define a protocol to make sure that all caches have a coherent
  view on memory
- This involves cache to cache communication for performance reasons, we want
  to avoid involving memory as much as we can

---
# Coping with Multiple Cores

- A bus is attached to every cache
- **Bus snooping**: hardware attached to each core’s cache
  - Observes all transactions on the bus
  - Able to modify the cache independently of the core
- This hardware can take action on seeing pertinent transactions on the bus
- Another way to look at it:  
**a cache can send/receive messages and data to/from other caches**

???

- In this lecture we are going to cover a simple protocol named bus-based coherence
  or bus snooping
- It works as follows
- First, all the cores are interconnected with a bus that is also linked to
  memory
- On each core you have some special cache management hardware
- This hardware can observe all the transactions on the bus and it is also
  able to modify the cache content independently of the core
- With this hardware, when a given cache observe pertinent transactions
  on the bus, it can take appropriate actions
- Another way to look at this is that a cache can send messages to other caches,
  and receive messages from other caches

---
name:msi
# Cache States, MSI Protocol

Cache has 2 control bits for each line it contains, indicating its state
--

---
template: msi

---
template: msi

.left3col[

- **Modified** state: the cache line is valid and has been written to, but the latest
    values have not been updated in memory yet
  - **A line can be in the modified state in at most 1 core**

]

.right3col[

]

---
template: msi
.left3col[
  - **Invalid**: there may be an address match on this line but the data is
    not valid
      - Load/stores should not be served from this cache: must fetch line from memory or get it from another cache
]

.right3col[
<div style="text-align:center"><img src="include/msi2.svg" width=250 /></div>
]

---
template: msi
.left3col[
  - **Shared**: not invalid and not modified
      - A valid cache entry exists and the line has the same values as main
        memory
      - **Several caches can have the same line in the shared state**
]

.right3col[
<div style="text-align:center"><img src="include/msi3.svg" width=250 /></div>
]

.center[
~

> Modified/Shared/Invalid states and the transitions taken upon cache accesses by the core define the    
**MSI protocol**
]

---
## Possible States for a Given Cache Line in a Dual-Core CPU

???
- We'll present the protocol on a dual core processor for the sake of simplicity
- For a given cache line, we have the following possible states
- A) one core has the line in the modified state, valid and not in sync with
  memory, and the other has the line invalid
- B) we can also have the line invalid in both caches
- C) the line can also be invalid in one cache, and shared in the other cache
  shared being valid and in sync with memory
- D) Finally, the line can be valid and in sync with memory in both caches,
  in other words in the shared state in both caches

---
## Possible States for a Given Cache Line in a Dual-Core CPU

<div style="text-align:center"><img src="include/states-2.svg" width=800 /></div>
???

- Note that with the definitions we gave previously, it is not possible to have
  a line modified in one core and shared on the other core
- Because modified implies that the line content is different from what is in
  memory, while shared implies that the line content is the same as in memory,
  in that case the two CPUs will cache different values for the line which breaks
  the coherence
- Also modified modified is not possible, because the switch to the modified
  state follows a write operations by the core, there are good chances that
  we'll end up again with 2 different values in both caches

---
## Possible States for a Given Cache Line in a Dual-Core CPU
<div style="text-align:center"><img src="include/states-3.svg" width=800 /></div>

???

- If we include symmetry, we also have
- A') which is the inverse of A, invalid modified
- and C') the inverse of C, shared invalid

---
# State Transitions
<div style="text-align:center"><img src="include/states-3.svg" width=700 /></div>

- All of these are legal states. **let's study how reads and writes (e.g. loads and stores) on each core should affect these states**

---
# State Transitions:
- State of a cache line may be changed when the core tries to read/write data in that line
- State transitions have 3 aspects:
  - What are the **messages** sent between caches:
      - *Read* messages: a cache requests a cache line from another
      - *Invalidate* messages: a cache asks another cache to invalidate one of its cache lines
  - Is there any **access made to main memory**
  - What are the **state changes**

???

- So we have listed all the possible legal states
- Let's see now, for each state, how read and write operations on each of the
  two cores affect the state
- This regards 3 aspects:
- what are the messages sent between cores: we'll see messages requesting a
  cache line, messaging asking a remote core to invalidate a given line, and
  messages asking for a line content as well as its invalidation
- we will also see when memory needs to be involved
- and what are the state transitions

--
- For a line in each valid state we identified, let's study what happens if the cores try to read/write data in that line

---
# State Transitions from (a)

---
# State Transitions from (a)

- **Read on core 1**: cache hit, served from cache
- **Write on core 1**: cache hit, served from cache

???

- Let's start with the modified/invalid state
- So core 1 has the line modified, it's valid but not in sync with memory
- And core 2 has the line invalid, it's in the cache, but the content is out
  of date
- If we have either a read or a write or core 1, these are just served by the
  cache and nothing changes

---
# State Transitions from (a)

- **Read on core 1**: cache hit, served from cache
- **Write on core 1**: cache hit, served from cache
- **Read on core 2**:
  - 2 places read request on the bus, snooped by 1
  - 1 writes back to memory, goes to S state, and sends the data to 2 which
    goes to S state
  - Overall change to state (d): shared/shared

???

- Now if there is a read on core 2 this is what happens
- Because the line is invalid it cant be served from core 2's cache
- Core 2 places a read request on the bus
- It gets snooped by core 1
- We can have only one cache in the modified state, so with this particular
  protocol we are aiming at a shared shared final state
- So core 1 writes back the data to memory to have it in sync and goes to the
  shared state
- Core 1 also sends the line content to core 2 that switches to the shared state
- And we end up in the shared shared state, i.e. both caches have the line valid
  and in sync with memory

---
# State Transitions from (a)

- **Write on core 2**:
  - 2 first needs to get the line **(write size < line size)**
  - 2 places read request on the bus
  - 1 snoops the request, sends to 2 and, as it is in M state, writes back to
    memory
  - 2 places invalidate request, core 1 switches to I
  - 2 writes in cache and switches to M
  - Overall state changes to (a'): invalid/modified

???

- Now last thing for this state, what happens if there is a write on core 2
- Core 2 has the line invalid so it places a read request on the
  bus
- It is snooped by core 1
- Core 1 has the data modified state so it first writes back to memory, and
  send the line to core 2
- Core 2 updates the line so it sends an invalidate message on the bus,
- So core 1 switches to the invalid state and because it was modified, it
  writes back the line to memory before switching to the invalid state
- And 2 switches to modified

---
# State Transitions from (b)

---
# State Transitions from (b)

- **Read on core 1**
  - Cache miss, core 1 does not know the state of the line in other core, place read request on bus, nobody answers
  - Fetches from memory, switches to S
  - Go to (c')
- **Read on core 2** is similar by symmetry - goes to state (c): invalid/shared

???

- Now let's see transitions from the invalid invalid state
- In that state both caches have the line, but its content is out of date
- If there is a read on one core, the core in question places a read request
  on the bus
- Nobody answers and the core fetch the data from memory, switches the state
  to shared
- The system ends up in the shared invalid state

---
# State Transitions from (b)

- **Write on core 1**:
  - Core 1 does not know the state of the line in other cores, places
    read request on the bus, nobody answers
  - Fetches from memory and performs write in cache, switches to M
  - State goes to (a)
- **Write on core 2** is similar by symmetry - goes to state (a'): invalid/modified

???

- Now if there is a write on a core, the core does not know about the status
  of the line in other cores so it places a read request on the bus
- Nobody answers so the line is fetched from memory and the write is performed
  in the cache so core 1 switch the state to modified
- We end up in modified invalid

---
# State Transitions from (c)

---
# State Transitions from (c)

- **Read on core 1:**
  - 1 places request on the bus, gets snooped by 2
  - 2 sends value to 1, which goes to S
  - Overall state goes to (d): shared/shared
- **Read on core 2**:
  - Cache hit, served from the cache, stays in (c): invalid/shared

???

- Let's now have a look at the c state
- The line is in the invalid state on one core, so it's present but the content
  is out of date
- And the other core has the line in the shared state so it is present and valid,
  and in sync with memory
- In case of a read on core 1, the core places a read request on the bus, it
  is snooped by 2 which replies with the cache line
- Core 1 switches to the shared state, the system is now in the shared shared
  state
- In case of a read on core 2, the  read is served from the cache and the
  system does not change

---
# State Transitions from (c)

- **Write on core 1**:
  - 1 places a read request on the bus, snooped by 2
  - 2 sends line to 1, it was in S so no need for writeback
  - 1 places invalidate request on the bus, 2 goes to I
  - 1 performs write in cache and goes to M
  - Overall state goes to (a): modified/invalid

???

- Now in case of a write on core 1, here is what happens
- Because core 1 has the line in the invalid state, it starts by placing a read
  request on the bus
- Core 2 snoops the request and replies with the data
- It's in the shared state so no need for writeback
- Core 1 is going to update the line so it places an invalidate request on the
  bus
- Core 2 receives it and switches to invalid
- Finally core 1 performs the write and switches to modified
- We end up in the modified invalid state

---
# State Transitions from (c)

- **Write on core 2**:
  - 2 does not know the line state in other caches, places invalidate request
    on the bus
  - 2 performs write in cache and goes to M
  - Overall state goes to (a'): invalid/modified

???

- In the case of a write on core 2, even if nobody needs to invalidate anything
  core 2 does not know it so places an invalidate message on the bus
- Afterwards it performs the write in the cache and switches to the modified
  state
- We end up in the invalid modified state

---
# State Transitions from (d)

---
# State Transitions from (d)

- **Read on core 1 or 2**: cache hit, served from the cache
- **Write on core 1**:
  - 1 places invalidate request on the bus, get snooped by 2
  - 2 goes to I, it was in S so no need for writeback
  - 1 performs write in cache, goes to M
  - Overall state goes to (a): modified/invalid
- **Write on core 2**: symmetry, state goes to (a'): invalid/modified

???

- Finally, let's see the d state, in which both caches have the data in the
  shared state
- if there is a read on any core, it is simply served from the cache
- If there is a write on a core, the core in question places an invalidate
  request on the bus
- The other core snoops the request and switches to invalid, it was shared so
  there is no need for writeback
- The first core can then perform the write and switch to the modified state,
  we end up in the modified invalid state

---
# Beyond Two Cores

.leftcol[
- Extension beyond 2 cores:
  - Snoopy bus messages are **broadcasted to all cores**
  - Any core with a valid value can respond to a read request
  - Upon receiving an invalidate request:
      - Any core in S invalidates without writeback
      - A core in M writes back then invalidates
]

.rightcol[
<div style="text-align:center"><img src="include/transition-mcore-1.svg" width=200 /></div>
<div style="text-align:center"><img src="include/transition-mcore-2.svg" width=200 /></div>
<div style="text-align:center"><img src="include/transition-mcore-3.svg" width=200 /></div>
]

???

- This protocol easily extends to more than 2 cores
- Because of the way the snoopy bus works, the read and invalidate messages
  are in effect broadcasted to all cores
- Any core with a valid value (shared or modified) can reply to a read request
- For example on the first case on the top, core 1 has the line in the invalid
  state and wishes to perform a read so it broadcasts a read request on the bus
  and one of the cores having the line in the shared state replies
- When an invalidate request is received, any core in the shared state
  invalidates without writeback, as in the second example it is the case for
  core 3 and core 4
- Still when an invalidate message is received, a core in the modified state
  writes back before invalidating

---
# Write-Invalidate vs. Write-Update

2 types of snooping protocols:
- **Write-invalidate**: 
  - When a core updates a cache line, other copies of that line in other caches
    are **invalidated**
  - Future accesses on the other copies will require fetching the updated line
    from memory/other caches
  - Most widespread protocol, used in MSI (this lecture), MESI, MOESI (next video)
--

- **Write-update**:
  - When a core updates a cache line, the modification is broadcasted to copies of
    that line in other caches: they are **updated**
  - **Higher bus traffic**, hence less popular vs. write-invalidate
  - Example protocols: Xerox Parc Dragon, DEC Firefly

---
# Major Implication
- The states of a given line in each core must be at all time **coherent**
  - Don't want to end up with e.g. 2 cores with the line in M state
--

- If one core writes and broadcasts invalidate:
  - No other core must be able to perform a read/write to that location as
    though they haven't seen the invalidate
- **All cores must see the invalidate at the same time, i.e. within the same bus
  cycle**

???

- So given our description of the way cache snooping works
- When an invalidate message is sent, it's important that all cores receive
  the message within a single bus cycle so that they invalidate at the same
  time
- If not one core may have the time to perform a write during the process and
  this breaks consistency

--
  - As we connect more cores this becomes more and more difficult
  - **The coherence protocol is a major limitation to the number of cores that
    can be supported**

???

- This is hard to achieve with high numbers of cores
- As we connect more cores the bus gets longer and the signal takes more time
  to propagate
- With more cores the bus capacitance is also higher and the bus cycle is longer
- This seriously impacts performance
- So overall, the coherence protocol is a major limitation to the number of
  cores that can be supported by a CPU

---
# Cache Hierarchies

- Most modern CPUs have several levels of cache
  - Per-core L1, shared L3, and either per-core or shared L2
  - Higher levels: higher capacities and access latencies

---
name: inclusive
# Cache Hierarchies: Inclusive Policy

- With the inclusive policy, **the content of higher level caches is always a superset of the content of lower level caches**

---
template: inclusive
<div style="text-align:center"><img src="include/hierarchy-1.svg" width=450 /></div>

---
template: inclusive
<div style="text-align:center"><img src="include/hierarchy-2.svg" width=800 /></div>

---
template: inclusive
<div style="text-align:center"><img src="include/hierarchy-3.svg" width=800 /></div>

---
template: inclusive
<div style="text-align:center"><img src="include/hierarchy-4.svg" width=800 /></div>

---
# Summary
- **Cache coherence** is necessary in shared memory multiprocessors, for cores to have a consistent view on memory
- Simple MSI protocol
  - **Modified**/**Shared**/**Invalid** states
  - Associated transitions upon read/writes from cores
      - Invalidate and line read requests on the interconnect
      - Read/write from/to main memory
      - State changes
- Bus-based CC protocol is limiting the number of cores supported
- Next:
  - MSI's optimisations: MESI and MOESI
  - Directory-based coherence protocol

???

- So let's wrap up
- It is necessary to maintain coherence between the content of the caches
  that are present per-core in a multiprocessor
- We have seen an example of a simple protocol named MSI, relying on 3 states
  that are modified, shared and invalid
- We covered each possible state in a dual core processor, as well as how
  the system evolves upon data read and write requests from the cores
- Both in terms of main memory accesses, invalidate and line read messages on
  the bus, and the corresponding state transitions
- Finally, we explained that bus-based cache coherence protocol is not a
  panacea and that it limits the number of cores we can have in a shared memory
  multiprocessor
- In the next videos, you'll first see two optimisation to MSI, MESI and MOESI
- You'll also see another way of achieving cache coherence that does not require
  broadcast messages and thus does not require directly connected buses as we
  have seen: directory based protocol