class: center, middle background-image: url(include/title-background.svg) # COMP35112 Chip Multiprocessors
# .white[Introduction 1] .white[Pierre Olivier] --- # Chip Multiprocessors?
.leftcol[
] ??? - So this is an abstract model of how you have seen a processor until now - In the core of the processor you have an arithmetic logic unit the ALU that performs the computations - You also have a set of registers that can hold the instructions' operands - And some additional control logic - And you also have an on-chip cache holding recently accessed data and instructions - That's the processor - And it fetches data and instruction from the main memory --- # Chip Multiprocessors?
.leftcol[
] ??? - In a chip multiprocessor, also called multicore processor, you have several instances of the core of the processor on a single chip - Here we have a dual-core - Each core is a duplicate of most of what you will find in a single core CPU: ALU, registers, caches, etc. - So while a single core processor can only execute one instruction at a time, a multicore can execute n instructions at the same time --
.center[Why was this invented?] --- # Chip Multiprocessors?
??? - Here are some pictures of transistors - From a very simplistic point of view you can see these as ON/OFF switches that sometimes let the current flow and sometimes not - These are the basic block with which we construct the logic gates that are making the modern integrated circuits used in processors --- # Chip Multiprocessors? - Computing power of a CPU is function of the number of transistors it integrates
??? - Now the computing power of a processor is proportional to the number of transistors it is made off - If you look at the first processor commercialised, in 1971, it had in the order of thousands of transistors - A few years later processors were made of tens of thousands of processors - And a few years later it was hundreds of thousands - And since the 2000s we are talking about millions of processors - Since that time processors also start to have multiple compute unites or cores - Before that they had only one --- # Chip Multiprocessors?
??? - Fast forward closer to today we now have tens of billions of transistors in a single chip - As well as several compute units - We have here a desktop processor with 6 cores - An embedded processor with 4 cores - And a recent server processor with 64 cores - Why did processor start to have more and more cores? -- **Why this increase in number of compute units (cores) per chip?** --- # Core Count Increase - For decades pre mid-2000s we had seen a continual increase in single-core processors **speed** - Over time transistors size would get (significantly) smaller, circuits get (a bit) bigger, **clock frequency increase** - Program is slow? Wait for the next generation CPU to get an automatic speed boost → free lunch for the programmer ??? - For over 40 years we have seen a continual increase in the power of processors - This has been driven primarily by improvements in (silicon) integrated circuit (IC) technology - Circuits will continue to get bigger (more transistors on a single ‘circuit’) -- - Due to power consumption and heat dissipation reasons the **clock speed stalled** in the mid-2000s - **Architectural approaches to increase single processor speed have been exhausted** ??? - But the basic circuit speed has now become limited - Architectural approaches to increase single processor speed have been exhausted --- # Core Count Increase - Processors don't get faster, but we can still put more transistors in a single integrated circuit - Solution: **more cores**! Several closely coupled compute units, i.e. several processors, working together, on a single circuit ??? - Now what do we do, if the speed of a single processor cannot increase - Well we put more processors on a single chip - And we try to have them work together somehow - Now the terms core and processor will be used interchangeably in the rest of this course -- - This has several implications and **2 x 3GHz != 6GHz** -- - **Hardware/architectural issues**: what kind of processor(s) to use? How are processors connected? How is memory organised? ??? - Ok now I want to create a dual-core processor - Will it be as fast as the single core version? - It's not that simple, it has several implications, this is what we'll cover in this course: - First, in terms of hardware, what processor or what processor(s) to put on my multicore? How to connect them? How to they access memory? -- - **Software issues**: how do we program this thing?
??? - And then we have the software issue, in other words how do we program these things? Can we still use the programming methods and languages that worked well on single cores? --- # Moore's Law - Not really a law, more an observation/prediction - **Transistor count on a chip doubles roughly every 2 years** - What drove this increase, transistor size or chip size? ??? - Now we have this thing called Moore's law - It's more of an observation by an engineer Gordon Moore that the number of transistor of a chip doubles roughly every 2 years -- - Feature size (basically transistor size) - Intel 386 (1985): 1.5 um - Apple M1 Max (2021): 5 nm - Decrease factor: 300x -- - Die (chip) size - 386: 100 mm
2
(275K transistors) - M1 Max: 420.2 mm
2
(57B transistors) - Increase factor: 4x ??? - It depends on two factors - The size of a transistor, as you can see it went from 1.5 um per transistor in 1985 to 5 nm in 2021, that's a 300x decrease - And the other factor is the size of the chip itself, and you can see this one went down only by 4x over the same period of time -- - Increase in transistors/chip mostly due to **transistor size reduction** ??? - So the increase of transistors per chip is mainly due to a transistor size reduction --- # Smaller Meant Faster, but not Anymore - Due to electric properties: - Smaller transistors have a faster switch delay - **They can be clocked at a higher frequency** - Circuit gets faster, programmer's free lunch -- - **But we hit limits** - Power density increases, heating becomes a problem ??? - So why do smaller transistor translate in faster circuits - From a high level point of view, due to their electric properties, smaller transistor have a faster switch delay - This means that they can be clocked with a higher frequency, making the overall circuit faster - This was the good old days of clock frequency increase, the free lunch for the programmer - Since the early 2000s we cannot do that anymore because the transistors are so tightly packed together on the chip that clocking them with too high of a frequency would basically melt the chip - There are also reliability issue that are making it harder and harder to have smaller transistor ??? - This cannot continue - Previous analysis is too simple -> things like interconnect capacitance start to prevent further speed increases - Power density increasing (more watts per unit area) -> cooling becomes a serious problem (See: “Dennard Scaling”, 1974) - Small transistors have less predictable characteristics as transistor sizes start to approach the atomic structure (impurity density) of the semiconductor -> cannot build reliable circuits --- # End of Dennard Scaling - **Dennard Scaling** (1974) - Transistors get smaller, so they consume less power as we pack more on the same chip - Power density stay constant → no power consumption/heating problem 👍 ??? - Dennard scaling was this law stating that, based on the electric properties of transistors, as they grew smaller, and we packed more on the same chip they consumed less power, so the power density stayed constant -- - **Dennard Scaling broke down in the mid-2000s** 🥲 - Mostly due to the high current leakage with small transistor sizes - **Power wall**, in 2004 Intel cancels Tejas & Jayhawk aiming at higher clock frequencies to refocus on multicores ??? - Unfortunately the law turned out to break down in the 2000s, mostly due to the high current leakage you get with smaller sizes --- # End of Dennard Scaling
??? - So as you can see both the power and the frequency hit a plateau around that time - And single thread performance does to, proportionally --- # Single Core Performance
??? - Here's another view on the issue, if you look at single threaded integer performance - This is based on a standard benchmark named SPEC - As you can see the increase in performance has been more than divided by two - And if there is still increase, it does not come from frequency but from things like bigger caches or deeper / more parallel pipelines --- # How to Go Faster? - How to use these extra transistors to get faster? ??? - So at that point how do we go faster? - We can't increase the clock frequency, but we can still pack more transistor on the same chip -- - Can we build ***faster single-core processors?*** ??? - Can we build faster single core processors? - It is difficult -- - **More parallel pipelines** (superscalar processors) to exploit Instruction Level Parallelism? - ILP has diminished returns beyond ~4 pipelines - **Bigger caches**: - Payback for bigger caches also diminishes rapidly ??? - One idea is to have several pipelines within a single processor, but studies have shown that it's hard to exploit instruction level parallelism beyond 4 pipelines - Increasing the size of caches also has its limits, and it is hard to get benefits past a certain size --- # The "Solution": Multiple Cores - Put multiple CPUs (cores) on a single integrated circuit (chip) - **"Multicore chip" or "Chip Multiprocessor"** - Use these CPUs in parallel to achieve higher performance - Simpler to design vs. increasingly complex single core CPUs - Need more computing power? Add more cores ... - ... not that simple in practice, 2 * 3GHz != 6GHz - I.e. a program written for 1 core can't expect performance improvements from running *as is* on a multicore CPU ??? - So the solution is to put multiple processors, multiple cores, on a single chip - We get what we call a multicore chip or a chip multiprocessor - And then we need to use these CPUs in parallel somehow to achieve higher performance - It is actually simpler to design than complex single core CPUs - You could say it a bit like copy pasting a single core CPU several times on a chip - So here we go, now when you need more computing power, you just have to add more cores, problem solved right? - It's not that simple in practice ??? - Put multiple CPUs (cores) on a single integrated circuit (chip) -> “multicore chip” or “chip multiprocessor” - Use the multiple CPUs in parallel to achieve higher performance - Simpler to design than a more complex single processor (basically replication) - Need more computing power? – just add more cores - Simple in principle, but the practice is a bit more difficult --- # Multicore "Roadmap" Year, cores per chip, feature size .leftcol[ - 2006, ~2 cores, 65nm - 2008, ~4 cores, 45nm - 2010, ~8 cores, 33nm - 2012, ~16 cores, 23nm - 2014, ~32 cores, 16nm - 2016, ~64 cores, 12nm - 2018, ~128 cores, 8nm ] .rightcol[ - 2020, ~256 cores, 6nm - 2022, ~512 cores, 4nm - 2024, ~1024 cores, 3nm - 2026, ~2048 cores, 2nm - *scale discontinuity?* - 2028, ~4096 cores - 2030, ~8192 cores - 2032, ~16384 cores ] **Estimations, where are we today?** Intel Sierra Forest (2024) has 144 cores, Apple Silicon M4 uses 3 nm transistors ??? - So here is an overview of the number of cores per chip as well as the transistor size, for the past years, as well as some projections - As you can see we went from the first dual cores in the 2000s to tens and even hundreds of cores today - This has been achieved with the help of a significant reduction in feature size - However it is unclear if this trend can continue as it becomes harder and harder to construct smaller processors --- # Moore's Law Today - What we've seen is sufficient to motivate "traditional" chip multiprocessors as they started to be introduced in the 2000s - We'll study them in the first part of the unit - But where are we today? -- - Transistors keep getting smaller but at a slower pace - Moore's law has now broken down - How can designers keep making increasingly efficient computers? - **Specialised** processors (e.g. GPUs) - **Heterogeneous** processors (e.g. ARM BIG.little) - Introducing different **programming models** to ease development, improve performance, etc. - These will be covered in the second part of the unit --- # Summary - Single core performance hit a plateau in the mid 2000s - To address that issue constructors started to pack more cores on a single chip - Next: **how to efficiently exploit the resulting parallelism**?