Secure Architectures and Systems - Hardware Support for Virtualization: CPU & Memory

class: center, middle

### Secure Computer Architecture and Systems
***
# CPU and Memory Virtualisation

???
- Hi everyone
- We have seen the key objectives of a modern hypervisor, and the requirements for an instruction set architecture to be virtualisable efficiently
- Here we discuss virtualisation support in modern 64 bits ISAs, focusing on Intel x86-64
- Contrary to its predecessor, that ISA was explicitly designed with virtualisation in mind

---
# Introduction

- **Intel x86-32 not virtualisable according to P&G**
  - Need to make concessions on performance (dynamic binary translation) or equivalence (paravirtualisation)

???
- We have seen that x86-32 was not virtualisable based on the Popek and Goldberg theorem
- And that attempts at virtualising it had to compromise on performance or equivalence

--
- One of the fundamental design goals of x86-64 was **architectural support for virtualisation**, achieved with 3 key technologies:
  - CPU virtualisation with Intel **Virtualisation Technology (VT-x)**
  - Memory virtualisation with **Extended Page Tables (EPT)**
  - I/O virtualisation with Intel **Virtualisation Technology for Directed I/O (VT-d)**

???
- Because of the high demand for virtualisation and the related problems with x86-32, the next generation ISA x86-64 that was first proposed in the early 2000s didn't make the same mistake
- x86-64 was designed with hardware-based virtualisation support in mind
- This is achieved with Intel processors using 3 key technologies
- VT-X for CPU virtualisation, Extended Page table for memory virtualisation, and VT-d for I/O virtualisation

Focus here on Intel, AMD has very similar technologies

???
- We'll focus on Intel here but note that AMD, the manufacturer of x86-64 CPUs, has very similar technologies

---
class: center, inverse, middle

# x86-64 CPU Virtualisation<br>with VT-x

???
- Let's start with CPU virtualisation

---
# x86-64 CPU Virtualisation

Software techniques for virtualising x86-32 face the following challenges:
- **Protection ring (privilege level) aliasing and compression**
  - With x86, ring 0 is supervisor mode, ring 3 is user mode
  - Guest OS runs in a different ring (3) vs. what it was designed for (0)

???
- The existing software techniques to virtualise x86-32 had the following challenges
- First, the guest OS runs in a privilege level it was not designed to run, which is user mode
- With x86 privilege levels are called ring
- Supervisor mode is ring 0 and user mode is ring 3
- And we have guest OSes running for virtualisation in ring 3 while they were designed to run in ring 0

--
- **Address space compression**
  - Need to locate the hypervisor somewhere and protect it

???
- Second, the hypervisor needs to be located somewhere in memory and be inaccessible from the guests

--
- **Performance impact of guest-host transitions**
  - Some sensitive instructions can be very frequent e.g. system calls
???
- The traps representing guest-host transitions are frequent and costly, leading to important performance slowdowns

- etc.

???
- Among other challenges

---
# VT-x Overview

- Key idea: don't address separately all of x86's aspects hindering virtualisation
  - E.g. don't change the semantics of individual instructions

???
- The key design idea behind VT-X, which is x86-64's hardware support for CPU virtualisation was to propose a holistic solution
- And not to address separately all the issues x86-32 had that made it hard to virtualise
- For example, changing the semantics of individual instructions such as POPF would be bad for retro compatibility reasons

--
- **VT-x duplicates the entire state of the CPU** into 2 modes of execution:
  - **Root mode** for hypervisor/host OS
  - **Non-root mode** for VMs

???
- x86-64's designers rather addressed all issues by introducing a new mode of execution
- The entire state of the CPU is duplicated into 2 modes
- Root mode for running the hypervisor and host operating system's code
- And non root mode for virtual machine's code

---
# VT-x Overview

???

- This is illustrated here, we have one machine with 1 hypervisor and host OS running in root mode in ring 0, and host-level application running in root mode in ring 3
- Then you have 2 VMs, each running a guest OS in non-root mode ring 0, and guest applications in non-root mode ring 3

---
# VT-x' Root/Non-Root Modes

- At any point in time **CPU is either in root or non-root mode**

???
- So at any point in time the CPU is either in root or in non-root mode

--
- **Protection rings are orthogonal to root mode** and available in both
  root/non-root

???
- Privilege levels are orthogonal to root/non-root modes and are available in both

--
- **Each mode has its own address space**, switched atomically as part of the
  transition
  - Including TLB content

???
- Each mode has its own address space which is switched automatically upon transitions
- Including virtual memory translation caches
- This allows the hypervisor and other host-level software to be well isolated from the guest software

---
# VT-x and the P&G Criteria

- **Equivalence**
  - Absolute architectural compatibility between virtual and actual hardware
  - Backwards compatible with legacy x86-32 and x86-64 ISAs

???
- Remember the key objectives for a proper hypervisor that we listed in the previous video
- In terms of equivalence, the state of the virtualised CPU exposed by VT-X in non root mode to VMs is an exact duplicate of the physical CPU state
- Guest can run x86-64 code but are also backward compatible with x86-32

--
- **Safety**
  - With architectural support, hypervisor codebase is much simpler
  - Reduced attack surface vs. DBT/PV solutions that need to maintain complex
    invariants
???
- In terms of safety, the isolation of the hypervisor from the guest, and the inter-guest isolation, are hardware-enforced and quite strong
- The hardware support for virtualisation makes the hypervisor codebase much simpler compared to past software attempts are virtualising x86-32
- Running less code translates into a smaller attack surface for the virtualisation layer

--
- **Performance**
  - Not a primary goal at first
  - First generation VT-x CPUs were slower than state-of-the-art DBT

???
- Regarding performance, it was not actually a priority goal at first
- In fact the first generation of VT-X processes were slower than existing, more mature, attempts at virtualising x86-32 in software

---
# VT-x and the P&G Theorem

- **With root/non-root modes, P&G requirements becomes:**

.medium[
> When executed in non-root mode, all sensitive instructions must either 1)
**cause a trap** or 2) be implemented by the CPU and **operate on the non-root
duplicate of the CPU state**
]

???
- With x86-64's root and non-root mode, we can rework the Popek and Goldberg's theorem as follows
- When executed in non-root mode, all sensitive instructions must either cause a trap or be implemented in hardware by the CPU and operate on the non-root duplicate of the CPU state

--
- Having each sensitive instruction trap to the VMM in root mode would
  satisfy the equivalence and safety criteria...

???
- As with saw before we can have every sensitive instruction trap to root mode

--
- ... but **frequent guest ⇔ VMM transitions need to be avoided for performance**
  - These transitions are costly: 1000s of cycles
  - Need to implement some sensitive instructions in hardware
  - Trade off: hardware complexity vs. performance

???
- However these traps are very costly, and we can't have them be too frequent
- Ideally we want as few traps as possible to keep performance close to native execution
- Obviously managing the virtualisation of more privileged instructions in hardware means implementing more logic in the CPU, so there is a trade off between hardware complexity and cost vs. performance here

---
# VT-x: Root/Non-Root Transitions

.left3col[
<div style="text-align:center"><img src="include/vmexit1.svg" width=430 /></div>
]

???
- Let's see briefly how VT-X manages transitions between root and non-root mode
- Assume the hypervisor is running in root mode at first

---
# VT-x: Root/Non-Root Transitions

.left3col[
<div style="text-align:center"><img src="include/vmexit2.svg" width=430 /></div>
]

.right3col[.medium[
- VMM starts/resumes VM with `VMLAUNCH` and `VMRESUME`
]]

???
- The hypervisor can start and resume a VM with the `VMLAUNCH` and `VMRESUME` instructions
- The CPU switches to non root mode and starts to run the guest

---
# VT-x: Root/Non-Root Transitions

.left3col[.medium[
<div style="text-align:center"><img src="include/vmexit3.svg" width=430 /></div>
]]

.right3col[.medium[
- VMM starts/resumes VM with `VMLAUNCH` and `VMRESUME`
- Involuntary (traps) or voluntary (`VMCALL`) traps are called vmexits
]]

???
- The other way around, transitions from the VM to the hypervisor are called vmexits
- The VM will transition to the hypervisor following a trap or an explicit call to switch to the hypervisor with the VMCALL instruction
- In these cases the CPU switches from non-root to root mode and starts running hypervisor code for it to handle the trap

---
# VT-x: Root/Non-Root Transitions

.left3col[.medium[
<div style="text-align:center"><img src="include/vmexit4.svg" width=430 /></div>
]]

.right3col[.medium[
- VMM starts/resumes VM with `VMLAUNCH` and `VMRESUME`
- Involuntary (traps) or voluntary (`VMCALL`) traps are called vmexits
  - VM state in **VM Control Structure**
      - Can be accessed with `VMREAD`/`VMWRITE`
]]

???
- When there is a VMEXIT, the CPU maintains a data structure containing information about the guest
- For example what was the reason for a VMEXIT 
- It is called the virtual machine control structure, VMCS
- And the hypervisor must use specific instructions to access it

---
# VT-x: Vmexit Categories

.medium[
| Category | Description |
| -------- | ----------- |
| **Exception** | Guest instruction caused an exception (e.g. division by 0) |
| **Interrupt** | Interrupt from I/O device received during guest execution |
| **Triple fault** | Guest triple faulted |
| **Root-mode sensitive** | x86 privileged/sensitive instructions |
| **Hypercall**| Explicit call to hypervisor through `VMCALL` |
| **I/O** | x86 I/O instructions e.g. `IN`/`OUT` |
| **EPT** | Memory virtualisation violations/misconfigurations|
| **Legacy emulation** | Instruction not implemented in non-root mode|
| **VT-x new** | ISA extension to control non-root execution (`VMRESUME`, etc.) |
]

???

- You have the list of categories of VMEXIT reasons here
- VMEXITs happen when the CPU faults or invoke a system call, these are software exceptions,
- when an interrupt is received from an I/O device
- when the guest does a triple page fault, or when it invokes a sensitive instruction
- The guest can also voluntarily trigger a VMEXIT, this is called a hypercall
- The hypercall is for an hypervisor the equivalent of what a system call is for an operating system
- You also have other categories such as I/O instructions, memory virtualisation VMEXITs, other instructions that need to be emulated, and VT-X instructions themselves
---
# KVM: Introduction

.left3col[
- **Type 2 hypervisor** built from the ground up assuming hardware support for
  virtualisation
- **Module part of the Linux kernel**, lets a host user space program create
  and manage VMs
  - Generally used with **Qemu** to emulate I/O devices
- Arguably today the most popular hypervisor in the world
]

.right3col[
<div style="text-align:center"><img src="include/kvm.svg" width=250 /></div>
]

???

- Just a few words about KVM, which is an hypervisor integrated in the Linux kernel and leveraging VT-X on x86-64
- KVM stands for kernel virtual machine
- It's a type 2 hypervisor designed within Linux from the ground up assuming hardware support for virtualisation, like VT-X for x86-64 and equivalent technologies for the other modern ISAs

---
# KVM: Introduction

.right3col[
<div style="text-align:center"><img src="include/kvm2.svg" width=250 /></div>
]

???
- KVM is a module part of the Linux kernel code, so it leaves in kernel space
- KVM partially manages virtual machines, by doing things like handling traps, maintaining the virtual machine control structure, etc.
- Still, KVM must also rely on a user space program to handle other virtual machine management tasks, in particular resource allocation
- That user space program is very often Qemu
- Qemu is originally a machine emulator, but CPU and memory emulation can be disabled when running on top of KVM, because they are managed by VT-X and the memory virtualisation technology we will cover very soon: this makes things much faster, close to native performance
- The couple KVM + Qemu is arguably the most popular hypervisor today

---
class: center, middle, inverse

# x86-64 MMU Virtualisation<br>with EPT

???

- We've covered the CPU, let's now talk about hardware-assisted memory virtualisation for x86-64

---

## VT-x without Hardware MMU Virtualisation

- **Disjoint page tables for root and non-root mode** (`%cr3` register)
  - Equivalence benefits: no need to locate hypervisor in the guest address space and to protect it with segmentation
???
- The first iterations of x86-64 did not have support for hardware assisted memory virtualisation, only VT-X for the CPU
- They assumed a disjoint page table for the root and non root mode, which was efficient to isolate the hypervisor from the guest by making sure they cannot map it

- **Guest page table updates still need to be validated/controlled by the VMM**
  - Shadow paging makes over 90% of vmexits, VT-x slower than software virtualisation
  - Paravirtualisation: loss of equivalence
???
- Because of that every update the guest would do to the page table needed to trap to the hypervisor to be validated, to make sure the guest does not try to map something it should not have access to
- This is called shadow paging which makes thing very slow
- Another option is paravirtualisation, to modify the guest not to update page tables directly, but rather to request the hypervisor to do so in a controlled fashion
- As we saw, paravirtualisation breaks equivalence so that solution is not ideal either

- **Hardware support for MMU virtualisation: Extended Page Tables**

.small[R. Bhargava et al., *Accelerating Two-Dimensional Page Walks for Virtualised Systems*, ASPLOS'08]

???
- So there was a need for hardware support for memory virtualisation, similar to what VT-X does with the CPU
- The technology for memory virtualisation is called extended page tables and was presented in this seminal paper in 2008

---
# Extended Page Tables

- **Guest OS maintains page tables normally**
  - 1 per process, mapping guest virtual to guest pseudo-physical addresses

???
- With extended page tables the guest OS maintains its page tables normally
- It can update them freely without traps to the hypervisor
- There is one page table per guest process, and it maps guest virtual to guest pseudo physical addresses

---
# Extended Page Tables

- **Second level of page table, EPT maintained by the hypervisor**
  - 1 per VM, maps guest physical to host physical addresses

???
- They key idea behind extended page table is to add a second level of address translation, the extended page table
- There is one extended page table per VM, and it maps guest pseudo physical addresses to host physical addresses
- They hypervisor is in total control of these extended page table, hence it can make sure the guest OSes map only the memory they can access

---
# Extended Page Tables

- **TLB caches guest virtual to host physical translation**
  - Hits lead to direct translation i.e. native performance
  - TLB hit rate in modern CPUs > 95%

???
- Having to walk 2 levels of page table is concerning from the performance point of view
- However the processor is made in such a way that the translation caches, the translation lookaside buffers, will cache directly the guest virtual to host physical mapping
- Knowing that the TLB hit rate is about 95% in modern CPUs, there is no need to walk two levels of page tables for the majority of VM memory accesses

---
# Extended Page Tables

- **TLB miss requires 2D page walk**
  - Going through guest PT + VM's EPT

???
- But if there is a TLB miss, then these two levels must be walked: the guest's page table, and the extended page table

---
name: pagewalk

# EPT Walk
**Standard (non-virtualised) page table walk upon TLB miss**:

???
- Recall how a standard page table walk is done by the MMU
- The page table is rooted in the `%cr3` register
- Different parts of the target virtual address will index each level of the page table
- Until the target data page is found
- An offset from the beginning of that page is added based on the least significant bits of the address, to find the target byte to load or store

---
# EPT Walk
<div style="text-align:center"><img src="include/ept-walk1.svg" width=700 /></div>

???

- When running virtualised, address translation must walk both the guest page table and the extended page table
- Things work as follows
- We have the virtual address targeted by the guest

---
# EPT Walk
<div style="text-align:center"><img src="include/ept-walk2.svg" width=700 /></div>

???
- The guest page table is rooted in cr3
- It contains a guest pseudo physical address, so we first need to translate it into a host physical address
- So we need to walk the extended page table to figure out which physical page contains the root of the page table, which is level 4

---
# EPT Walk
<div style="text-align:center"><img src="include/ept-walk3.svg" width=700 /></div>

???
- Once we found it, it can be indexed with the most significant bits of the target address, which gives us the address of the next level page (level 3)
- However that address is a guest virtual address, and we need to similarly transform it into a host physical one, so we walk the extended page table again

---
# EPT Walk
<div style="text-align:center"><img src="include/ept-walk4.svg" width=700 /></div>

???

- Rinse and repeat to find the level 2 page

---
# EPT Walk
<div style="text-align:center"><img src="include/ept-walk5.svg" width=700 /></div>

???

- And we do things again to find the level 1 page

---
# EPT Walk
<div style="text-align:center"><img src="include/ept-walk6.svg" width=700 /></div>

???

- And again to find the target data page

---
# EPT Walk
<div style="text-align:center"><img src="include/ept-walk7.svg" width=700 /></div>
.medium[24 memory access vs. 4 for standard page table walk]

???

- Which can finally be indexed by the least significant bits of the target guest virtual address to find the byte the guest wants to load or store
- In the end, to walk the 2D page table, we had to do 24 memory accesses to load or store a single byte
- That is compared to 4 memory accesses to walk a standard page table when we run non-virtualised
- That's a very high overhead, but remember that 95% of the guest memory access don't need to go through this because they are hits in the TLB translation cache

---
# Memory Virtualisation in KVM
<div style="text-align:center"><img src="include/kvm-ept1.svg" width=700 /></div>

???
- So KVM makes of course use of extended page tables to manage VMs memory
- Things work like that
- The guest manages its own page tables, one per guest process, with minimal intervention from KVM

---
# Memory Virtualisation in KVM
<div style="text-align:center"><img src="include/kvm-ept2.svg" width=700 /></div>

???

- Qemu lives in the host user space as a regular process
- Like every other process it has its own virtual address space

---
# Memory Virtualisation in KVM
<div style="text-align:center"><img src="include/kvm-ept3.svg" width=700 /></div>

???
- Qemu makes a big call to malloc to allocates a large contiguous buffer that will be the guest' pseudo physical memory

---
# Memory Virtualisation in KVM
<div style="text-align:center"><img src="include/kvm-ept4.svg" width=700 /></div>

???
- In the host kernel lives the KVM module

---
# Memory Virtualisation in KVM
<div style="text-align:center"><img src="include/kvm-ept5.svg" width=700 /></div>

???
- It sets up and manages the extended page table that maps the guest pseudo physical addresses to the host physical memory

---
# Memory Virtualisation in KVM
<div style="text-align:center"><img src="include/kvm-ept6.svg" width=700 /></div>

???
- Sometimes the Qemu process needs to read and write in the VM's memory too
- For example when virtualising I/O as we will see next
- For that it can just read and write in that large area of virtual memory it allocated for the VM
- And the page table of Qemu on the host will be used for the translation like any other host process

---
# Wrapping Up

- Hardware support for CPU and Memory in x86-64 achieved through 2 core technologies:

???
- To sum up, we talked about hardware support for virtualisation in x86-64, with 2 core technologies
--
  - **VT-x for CPU virtualisation**
      - Duplicate the OS-visible state of the CPU into **root** and **non-root** modes,
        make sure every sensitive instruction traps from non-root to root
      - KVM in Linux: split model with part of the hypervisor in root mode in
        the kernel handling most traps, and a user-space program (e.g. Qemu) handling I/O
???
- First, VT-X for CPU virtualisation
- It duplicates the CPU state between root mode for the hypervisor and host software, and non root mode for the guest software
--
  - **EPT for Memory virtualisation**
      - **Second level of page table** handling guest physical to host physical
        translation: 2D walk on TLB miss

???
- Second, extended page table for memory virtualisation
- It represents a second level of page tables mapping guest pseudo physical addresses to host physical ones, letting the guest manage its own page tables without intervention from the hypervisor
- We also saw how KVM in the Linux kernel manages these technologies in conjunction with a user space virtual machine monitor such as Qemu