class: center, middle background-image: url(include/title-background.svg) # .lift[.right[Virtualization 101]] ### .right[Hardware Support for Virtualisation: CPU & Memory] .right[Pierre Olivier .right[[pierre.olivier@manchester.ac.uk](mailto:pierre.olivier@manchester.ac.uk)] ] --- # Introduction - **Intel x86-32 not virtualisable according to P&G** - Need to make concessions on performance (dynamic binary translation) or equivalence (paravirtualisation) -- - One of the fundamental design goals of x86-64 was **architectural support for virtualisation**, achieved with 3 key technologies: - CPU virtualisation with Intel **Virtualisation Technology (VT-x)** - Memory virtualisation with **Extended Page Tables (EPT)** - I/O virtualisation with Intel **Virtualisation Technology for Directed I/O (VT-d)** -- Focus here on Intel, AMD has very similar technologies --- class: center, inverse, middle # x86-64 CPU Virtualisation
with VT-x --- # x86-64 CPU Virtualisation Software techniques for Virtualising x86 face the following challenges: - **Protection ring (privilege level) aliasing and compression** - Guest OS runs in a different ring it used to -- - **Address space compression** - Need to locate the hypervisor somewhere and protect it -- - **Performance impact of guest-host transitions** - Some sensitive instructions can be very frequent e.g. system calls -- - etc. --- # VT-x Overview - Don't change the semantics of individual instruction - Don't address separately all of x86's aspects hindering virtualisation -- - **VT-x duplicates the entire state of the CPU** into 2 modes of execution: - **Root mode** for hypervisor/host OS - **Non-root mode** for VMs --- # VT-x Overview
--- # VT-x' Root/Non-Root Modes - At any point in time **CPU is either in root or non-root mode** -- - **Protection rings are orthogonal to root mode** and available in both root/non-root -- - **Each mode has its own address space**, switched atomically as part of the transition - Including TLB content -- - **Each mode has its own interrupt flag** - Non-root can manipulate it freely - Interrupts are delivered in root mode, following a transition if needed --- # VT-x and the P&G Criteria - **Equivalence** - Absolute architectural compatibility between virtual and actual hardware - Backwards compatible with legacy x86-32 and x86-64 ISAs -- - **Safety** - With architectural support, hypervisor codebase is much simpler - Reduced attack surface vs. DBT/PV solutions that need to maintain complex invariants -- - **Performance** - Not a primary goal at first - First generation VT-x CPUs were slower than state-of-the-art DBT --- # VT-x and the P&G Theorem - **With root/non-root modes, P&G requirements becomes:** .medium[ > When executed in non-root mode, all sensitive instructions must either 1) **cause a trap** or 2) be implemented by the CPU and **operate on the non-root duplicate of the CPU state** ] -- - Having each sensitive instruction trap to the VMM in root mode would satisfy the equivalence and safety criteria... -- - ... but **frequent guest ⇔ VMM transitions need to be avoided for performance** - These transitions are costly: 1000s of cycles - Need to implement some sensitive instructions in hardware - Tradeoff: hardware complexity vs. performance --- # VT-x: Root/Non-Root Transitions .left3col[
] --- # VT-x: Root/Non-Root Transitions .left3col[
] .right3col[.medium[ - VMM starts/resumes VM with `VMLAUNCH` and `VMRESUME` ]] --- # VT-x: Root/Non-Root Transitions .left3col[.medium[
]] .right3col[.medium[ - VMM starts/resumes VM with `VMLAUNCH` and `VMRESUME` - Involuntary (traps) or voluntary (`VMCALL`) traps are called vmexits ]] --- # VT-x: Root/Non-Root Transitions .left3col[.medium[
]] .right3col[.medium[ - VMM starts/resumes VM with `VMLAUNCH` and `VMRESUME` - Involuntary (traps) or voluntary (`VMCALL`) traps are called vmexits - VM state in **VM Control Structure** - Can be accessed with `VMREAD`/`VMWRITE` ]] --- # VT-x: Vmexit Categories .medium[ | Category | Description | | -------- | ----------- | | **Exception** | Guest instruction caused an exception (e.g. division by 0) | | **Interrupt** | Interrupt from I/O device received during guest execution | | **Triple fault** | Guest triple faulted | | **Root-mode sensitive** | x86 privileged/sensitive instructions | | **Hypercall**| Explicit call to hypervisor through `VMCALL` | | **I/O** | x86 I/O instructions e.g. `IN`/`OUT` | | **EPT** | Memory virtualisation violations/misconfigurations| | **Legacy emulation** | Instruction not implemented in non-root mode| | **VT-x new** | ISA extension to control non-root execution (`VMRESUME`, etc.) | ] --- class: center, middle, inverse # KVM: A Hypervisor for VT-x --- # KVM: Introduction .left3col[ - **Type 2 hypervisor** built from the ground up assuming hardware support for virtualisation - **Module part of the Linux kernel**, lets a host user space program create and manage VMs - Generally used with **Qemu** to emulate I/O devices - Arguably today the most popular hypervisor in the world ] .right3col[
] --- # KVM: Introduction .left3col[ - **Type 2 hypervisor** built from the ground up assuming hardware support for virtualisation - **Module part of the Linux kernel**, lets a host user space program create and manage VMs - Generally used with **Qemu** to emulate I/O devices - Arguably today the most popular hypervisor in the world ] .right3col[
] --- # KVM: P&G Criteria .left3col[ - **Equivalence**: KVM can run arbitrary x86-64 and x86-32 guest OSes and applications - **Safety**: KVM virtualises all the hardware resources CPU/Memory/IO - KVM stays in control of the hardware at all time, even with a malicious guest OS - **Performance**: handles all performance-critical tasks in the kernel module - Leave the rest to host kernel/userland software ] .right3col[
] --- # KVM: Hypervisor Operation Loop .left3col[
] --- # KVM: Hypervisor Operation Loop .left3col[
] .right3col[.medium[ - Many traps can be simply handled by looking at the VMCS: - Emulate - Inject fault/interrupt, - Change environment and retry (EPT) - Do nothing ]] --- # KVM: Hypervisor Operation Loop .left3col[
] .right3col[.medium[ - In some cases VMCS' content is not enough - Need a general purpose x86 instruction decoder and emulator 1. Fetch instruction from guest memory 2. Decode it 3. Verify it 4. Read memory operands 5. Emulate it 6. Write operands 7. Update guest registers and PC ]] --- # KVM: Hypervisor Operation Loop .left3col[
] .right3col[.medium[ - I/O emulation is handled by the user space part of the hypervisor (e.g. Qemu) ]] --- class: center, middle, inverse # x86-64 MMU Virtualisation
with EPT --- ## VT-x without Hardware MMU Virtualisation - **Disjoint page tables for root and non-root mode** (`%cr3` register) - Equivalence benefits: no need to locate hypervisor in the guest address space and to protect it with segmentation -- - **Guest page table updates still need to be validated/controlled by the VMM** - Shadow paging makes over 90% of vmexits, VT-x slower than software virtualisation - Paravirtualisation: loss of equivalence -- - **Hardware support for MMU virtualisation: Extended Page Tables** .small[R. Bhargava et al., *Accelerating Two-Dimensional Page Walks for Virtualised Systems*, ASPLOS'08] --- # Extended Page Tables
- **Guest OS maintains page tables normally** - 1 per process, mapping guest virtual to guest pseudo-physical addresses --- # Extended Page Tables
- **Second level of page table, EPT maintained by the hypervisor** - 1 per VM, maps guest physical to host physical addresses --- # Extended Page Tables
- **TLB caches guest virtual to host physical translation** - Hits lead to direct translation i.e. native performance - TLB hit rate in modern CPUs > 95% --- # Extended Page Tables
- **TLB miss requires 2D page walk** - Going through guest PT + VM's EPT --- name: pagewalk # EPT Walk **Standard (non-virtualised) page table walk upon TLB miss**: --- template:pagewalk
--- template:pagewalk
--- template:pagewalk
--- template:pagewalk
--- template:pagewalk
--- template:pagewalk
--- template:pagewalk
--- # EPT Walk
--- # EPT Walk
--- # EPT Walk
--- # EPT Walk
--- # EPT Walk
--- # EPT Walk
--- # EPT Walk
.medium[24 memory access vs. 4 for standard page table walk] --- # Memory Virtualisation in KVM
--- # Memory Virtualisation in KVM
--- # Memory Virtualisation in KVM
--- # Memory Virtualisation in KVM
--- # Memory Virtualisation in KVM
--- # Memory Virtualisation in KVM
--- class: inverse, middle, center # Wrapping Up --- # Wrapping Up - Hardware support for CPU and Memory in x86-64 achieved through 2 core technologies: -- - **VT-x for CPU virtualisation** - Duplicate the OS-visible state of the CPU into **root** and **non-root** modes, make sure every sensitive instruction traps from non-root to root - KVM in Linux: split model with part of the hypervisor in root mode in the kernel handling most traps, and a user-space program (e.g. Qemu) handling I/O -- - **EPT for Memory virtualisation** - **Second level of page table** handling guest physical to host physical translation: 2D walk on TLB miss - KVM maintains EPT in root mode