class: center, middle background-image: url(include/title-background.svg) # COMP35112 Chip Multiprocessors
# .white[Directory-Based Cache Coherence] .white[Pierre Olivier] ??? - Hello everyone - In this video we will cover a cache coherence protocol that is implemented quite differently from the bus-based protocols we have seen until now - It is named the directory-based coherence protocol --- # Directory Based Coherence - Shared bus coherence does not scale to a large number of cores - Coherence scheme with a less directly connected network? ??? - So shared bus based coherence does not scale well to large amount of cores - This is because only one entity can use the bus at a time - It limits performance - Now the question is can we implement a cache coherence protocol with a less directly connected network? --
??? - For example a grid, or a general packet switched network -- - **1 possible solution: directory centralising cache line information** ??? - One possible solution is to use a ** *directory* holding information about data (i.e. cache lines) in the memory** - We are going to describe a simple version of this scheme, that uses a protocol that resembles the previously seen MSI protocol --- # Directory Structure .leftcol[
] .rightcol[ - Each **directory entry** (1 per cache line) has: - *Present* bitmap: which core has a copy - *Directory dirty bit*: only one owner and that copy is out of sync with memory - Each line in each cache also has: - *Local valid bit*: is the line valid? - *Local dirty bit*: core is sole owner and data is out of sync with memory ] ??? - Overall the architecture of a directory-based cache coherence system looks like this - We have the cores on the left, each with its local cache - They are connected to an interconnect network, as we mentioned something less directly connected than with bus-based coherence - The memory is also connected to the network, as well as a component named the directory - The directory contains one entry for each possible cache line, so its size depends on the amount of memory - Each entry in the directory has n present bits, n being the number of cores - If one of these bits is set, it means that the corresponding cache has a copy of the line in question - For each entry there is also a dirty bit in the directory - When it is set it means that only 1 cache has the line, that the cache is the only owner of that line, and that this copy is out of sync with memory - In every cache each line also has a local valid and a local dirty bit - The local valid bit indicates if the line is valid or not - The local dirty bit indicates if the core is the sole owner of the data, in which case the data is also out of sync with memory - A core wishing to make a memory access may need to query the directory about the state of the line to be accessed --- # Directory Protocol **Read hit in local cache** - No need for directory access: simply read local value
??? - So like what we did for MSI let's have a look at what happens in various scenarios - We'll start with an easy one: a read hit in local cache - Core 1 wants to read some data, and it is both present and valid in the cache - There is no need to contact the directory, and core 1 just reads from the cache --- # Directory Protocol .leftcol[ .medium[ **Read miss in local cache**: - ***Directory dirty bit unset?*** - Query directory and get line from other cache if present, fetch from main memory otherwise - Set directory present bit for the reading core - Set local valid bit in the cache of the reading core ] ] .rightcol[
] ??? - Now let's see what happens upon a read miss, either when the data is not present or present but not valid in the local cache - Here core 1 wants to read, but the line is invalid - First we check the dirty bit for the line in the directory - If it's unset, the line may be present in other caches, and it's up to date with memory, or the line could also bot be in any cache - The present bits in the directory are consulted to see if any cache holds the line - If that's the case the line can be retrieved from there, otherwise it is fetched from memory - Once it is done we set for the reading core the present bit in the directory, as well as the valid bit in the local cache --- # Directory Protocol .leftcol[ .medium[ **Read miss in local cache**: - ***Directory dirty bit set?*** - One-and-only owner core updates memory, clears its dirty bit, and sends the line to the reading core - Clear directory dirty bit - Sets for the reading core present bit in the directory, and local valid bit ] ] .rightcol[
] ??? - Now in the case of a read miss with the directory dirty bit set - We know for sure that another single cache has the last version of the data, and that it is out of sync with memory - That remote core syncs up with memory, clears its dirty bit, and sends the line to the reading core - Then the directory dirty bit can be cleared as there is no exclusive owner anymore - We also set for the reading core the present bit in the directory, and the local valid bit --- # Directory Protocol **Write hit in local cache with local dirty bit set** - Just update local cache
??? - Let's switch to write - In case of the write hit in local cache with the local dirty bit set things are easy: the core knows he's the exclusive owner so it just updates the data in the local cache --- # Directory Protocol **Write hit in local cache with local dirty bit unset** .leftcol[.medium[ - ***Directory dirty bit unset?*** - Consult present bits in directory and send invalidate messages to any cores having the data - Set local dirty bit for writing code - Set directory dirty bit ]] .rightcol[
] ??? - Now with a write hit, here for example on core 1, with the local dirty bit unset - In this scenario the directory bit should be unset - It means there is no sole owner of the cache line, but there could be in other caches some copies of the line that are in sync with memory - So we consult the directory present bits and send an invalidate message to all cores with the present bit. - Upon reception these cores clear their local valid bits, and we also clear the present bit for these cores in the directory - The writing core performs the update - The writing core also sets its local dirty bit as well as the directory dirty bit as it is now the sole owner of the data that is out of sync with memory --- # Directory Protocol **Write hit in local cache with local dirty bit unset** .leftcol[.medium[ - ***Directory dirty bit unset?*** - Send invalidate to any cores x with p[x] set and then clear these bits - Set p[i] bit for writing core and set directory dirty bit - Set local dirty bit - ***Directory dirty bit set?*** - **Cannot happen:** local dirty bit unset means the line is shared and there cannot be a unique owner ]] .rightcol[
] ??? - Note that a write hit with directory dirty bit unset cannot happen, because if our local dirty bit is set we are the sole owner of the data and this should be reflected by the directory dirty bit --- # Directory Protocol .leftcol[.medium[ **Write miss in local cache** - ***Directory dirty bit unset?*** - Consult directory present bits, get cache line from any core that has it or if not fetch line from memory - Send invalidate to any core x with present bit set in the directory, and clear those bits - Set directory present bit bit for the writing core and set directory dirty bit - Set local dirty bit in the writing core ]] .rightcol[
] ??? - So what about write misses? - Once again things depend on the directory dirty bit - If it is not set there is no exclusive owner - After consulting the present bits in the directory the writing core first get the cache line from another core, or fetches it from memory if no other core has it - Then it sends an invalidate request to all cores with the presence bit set - The writing core then performs the write and update 3 things: - Its local dirty bit, the directory presence bit, and the directory dirty bit as he is the sole owner of the cache line, which is out of sync with memory --- # Directory Protocol .leftcol[.medium[ **Write miss in local cache** - ***Directory dirty bit set?*** - Writing core sends message to owner core to update memory and send the cache line - Writing core updates data and sets its local dirty bit - Leave directory dirty bit set - Clear previous owner's directory present bit, set the writing core's directory present bit ]] .rightcol[
] ??? - Now, when there is a write miss and the directory dirty bit is set - It means that another core has the exclusive last version of the data - The writing core sends a message to the owner, which updates memory and sends the cache line to the writing core - Next the writing core performs the write, and set its local dirty bit - In the directory, the dirty bit stays at 1 because we still have an exclusive owner - However that owner has changed, so we update the directory present bits accordingly --- # Analysis - ~MSI - Optimisations are possible - **Central directory is a serious bottleneck** - Distribute directory and cache it - Used in multi-sockets (CPU chips) systems ??? - So we described a directory-based protocol that is roughly equivalent to the bus-based MSI protocol - There are multiple optimisations possible but we won't go into much details - The important thing to note is that, even if directory-based coherency is designed to scale to more cores than snooping, having a single directory centralising coherency metadata is a serious bottleneck - So the solution is to distribute this metadata, and have multiple directories, each taking care of a subset of the memory address space - This is often coupled with a distributed memory structures where part of the memory is physically local to the processor and part is remote - This is particular to medium and large multi-processor systems that have multiple CPU chips --- # NUMA Systems
NUMA stands for Non-Uniform Memory Access ??? - Here's an example - We have 2 sockets, which means 2 processor chips, connected, so they can operate on a single shared address space - However part of the physical memory is local to socket 1 and part is local to socket 2, and we have 2 directories - Access non-local memory takes more time, this is why these systems are NUMA, non-uniform memory access --- # Drawbacks - Slower communications - Long/variable delays requires handshakes - Machines used directory-based coherency: - SGI Origin (up to 2048 cores), Xeon Phi (60 cores).
??? - So directory-based coherency is not a panacea and there are a few drawbacks - Without a common bus network many of the previous **communications will take a significant number of CPU cycles** - In the presence of long and possibly delays such protocols usually require **replies to messages**, handshakes, to work correctly - Some machines that used directory-based coherency include SGI Origin and also - However many doubt that it can be made to work efficiently for heavily shared memory applications --- # Summary - Bus-based snooping does not scale to large numbers of core - Directory-based coherency to the rescue, but not a panacea - ***Do we really need cache coherency?*** ??? - In summary - Cache coherence implemented by bus snooping does not scale to large numbers of cores - Directory systems do not need a bus - But the inherent delays and communication overheads are unlikely to lead to a solution for heavily used large-scale shared memory - A major question is **whether cache coherence is really necessary in a shared memory system** - Much of this is concerned with the parallel programming model used