class: center, middle background-image: url(include/title-background.svg) # .right[Virtualisation 101] ## .right[The Theory of Virtualisation] .right[Pierre Olivier .right[[pierre.olivier@manchester.ac.uk](mailto:pierre.olivier@manchester.ac.uk)] ] --- class: inverse, center, middle # Introduction --- # The Popek and Goldberg Theorem - **Seminal paper published in 1974 in Communications of the ACM** .leftcol[ .medium[ > Popek, Gerald J., and Robert P. Goldberg. **"Formal requirements for virtualisable third generation architectures."** Communications of the ACM 17.7 (1974): 412-421. ]] .rightcol[
] --- # The Popek and Goldberg Theorem - **Paper published in 1974 in Communications of the ACM** - Original idea of the paper: **show that some contemporary architectures were not virtualisable** - DEC PDP-10 taken as a case study -- - To that aim: - The paper describes the **criteria for a proper virtual machine monitor: *safety*, *efficiency*, *equivalence*** -- - It also defines accordingly the **requirements for an ISA to be virtualisable** - ISA: Instruction Set Architecture (x86-64, x86-32, aarch64, etc.) - Virtualisable: a VMM can be constructed on that ISA in a way that satisfy the aforementioned criteria --- # The Popek and Goldberg Theorem - Lack of popularity for virtualisation at the time the paper was published (70s) -- - **Later, VMs become popular (end of 90s)** - **x86-32 (most popular ISA at the time) not virtualisable!** - In this paper P&G describe an ISA's requirements to support a VMM that itself support arbitrary guests, relying exclusively on direct execution -- - This is now a **seminal paper on virtualisation** - AMD & Intel explicitly designed AMD-V and Intel VT-X for x86-64 in the 2000s to meet the criteria defined in the paper -- - More importantly, we’ll also learn through this theorem the **fundamental principles behind hypervisor operation on virtualisable ISAs** --- # The Popek and Goldberg Theorem - **We explain the theorem as follows:** - Explain P&G **simplified CPU model** - Simple hardware platform, but still representative of modern CPUs, as a support for the theorem -- - Explain **how a regular OS would run on that simplified CPU model** (without virtualisation) -- - Give the **theorem**: what characteristics an ISA needs to exhibit in order to be able to run a VMM and VMs -- - **Describe a VMM for that simplified CPU model** - Which properties it should satisfy to do its job properly - How it operates -- - Give some examples of **theorem violations** (ISAs not virtualisable) --- class: inverse, center, middle # Simplified CPU Model --- name: segmentation # Simplified CPU Model (1) - One processor core with **2 execution modes: user and supervisor (AKA kernel mode)** - **Physical memory** is contiguous, starts at `0`, size: `SZ` - **Virtual memory implemented through segmentation** (no paging): - Single segment: Base `B`, Limit `L` - Virtual range `[0, L[` mapped to physical range `[B, B + L[` --- template: segmentation
--- template: segmentation
--- template: segmentation
--- # Simplified CPU Model (2) - CPU state: **Processor Status Word** (PSW): (`M`, `B`, `L`, `PC`) - Execution level `M` = {`s`, `u`} (supervisor or user) - Segment register (`B`, `L`) (physical address of base, and length) - The current program counter: `PC`, virtual address - Virtual address of the instruction currently executed
--- name: psw # Simplified CPU Model (3) .leftcol[ .medium[ - Support for **saving PSW content in memory** `MEM[0]` and loading a new value from `MEM[1]` - Action of entering the OS following an interrupt (also called **trap**) - Support for **loading PSW content from a location in memory** - Exiting the OS after a trap processing - Examples of traps: hardware interrupts, exceptions (software faults, system calls, **privileged instructions**) ]] --- template: psw .rightcol[
] --- template: psw .rightcol[
] --- template: psw .rightcol[
] --- template: psw .rightcol[
] --- template: psw .rightcol[
] --- # Simplified CPU Model - **OS operation without a hypervisor:** 1. Kernel runs in `M` = `s`, apps in `M` = `u` -- 2. Kernel sets trap entry point during initialisation - `MEM[1] ← (M:s, B:0, L:SZ, PC:trap_entry_point)` -- 3. Kernel allocates a contiguous range of physical memory (`B`, `L`) for each application -- 4. Kernel launches/resumes apps with address space `[B, B+L[`, currently executing `PC`: - `PSW ← (M:u, B:B, L:L, PC:PC)` -- 5. At the trap entry point, kernel decodes the instruction `MEM[0].PC`, determines the cause of the trap and takes appropriate actions --- # Building the Hypervisor To build a hypervisor for that CPU, this is the research question asked by Popek & Goldberg: -- - **Given a computer defined according to the model, under which conditions can a hypervisor be constructed so that it:** - can execute one or more VMs; -- - is in complete control of the hardware at all times; -- - supports arbitrary, unmodified, and *potentially malicious* guest OSes designed for the same architecture; and -- - is efficient and show at worst a small performance decrease vs. non-virtualised execution? --- # Building the Hypervisor **To satisfy these requirements, the hypervisor needs to comply with these criteria**: -- 1. **Safety** - VMM in complete control of the hardware at all time - No assumption on guests, they can be malicious - VMM must enforce isolation between a VM and the VMM/hardware, and between VMs themselves (no shared state) -- 2. **Equivalence** - VM is a duplicate of the underlying physical machine - Guest OS and apps behave the same natively and in a VM - They run *unmodified* in the VM -- 3. **Performance** - Minimal decrease in a virtualised program execution speed --- class: inverse, center, middle # The Popek and Goldberg Theorem --- name:theorem # The Theorem - We want **any guest CPU instruction that may allow escaping isolation to be caught by the hypervisor** -- - Central idea behind constructing an efficient and secure hypervisor: - **run hypervisor in supervisor mode** and - **run guest apps *and guest OS* in user mode** -- - Not doable with every ISA - Theorem tells us how an ISA should behave for such an hypervisor to work --- # The Theorem - We can classify the instructions of an ISA into 2 categories: 1. **Sensitive instructions**, subdivided into: - **Control sensitive:** instruction that *updates the system state* - E.g. instructions modifying PSW in our example, `LGDT` on x86-32: load interrupt descriptor table -- - **Behaviour sensitive:** instruction whose *semantics depends on the value of the system state* (e.g. the privilege level) - E.g. `POPF` loads status register with data from the stack, works in supervisor mode but fails silently in user mode -- 2. **Innocuous instructions:** instruction that are not sensitive -- - We also define the concept of **privileged instructions** - Can only be executed in supervisor mode and traps when executed in user mode - E.g. `HLT` traps if `%cpl != 0` (x86-32's supervisor mode) --- # The Theorem - **Sensitive instructions**: update the system state or semantics depends on the value of the system state - **Privileged instructions:** work only in supervisor mode, trap in user mode Theorem: > **For a given ISA, a VMM may be constructed if the set of sensitive instructions for that ISA is a subset of the set of privileged instructions**, i.e. if
{control-sensitive} ∪ {behaviour-sensitive} ⊆ {privileged} --- # The Theorem
--- # The Theorem
--- # The Theorem **{control-sensitive} ∪ {behaviour-sensitive} ⊆ {privileged}** - **Converse holds too: if the criteria is not met, a VMM cannot be constructed for that architecture** -- - If a **control-sensitive** instruction does not trap, any guest can modify the system state without supervision/check from the VMM - E.g. a guest OS installing an arbitrary page table: **loss of safety** -- - If a **behaviour-sensitive** instruction does not trap, the behaviour of such instructions will differ from non-virtualised execution when executed by the guest OS - Remember that the guest OS runs in user mode when virtualised - Guest OS instruction executed with user-level semantics: **loss of equivalence** --- # Hypervisor Operation Under these conditions, VMM operates as follows: - For performance reasons we want to run as much guest code as possible directly on the CPU i.e. without involving (i.e. trapping to) the VMM - **VMM runs in supervisor mode, guest OS runs in user mode**
--- # Hypervisor Operation - **VMM allocates contiguous physical memory for himself, never accessible by guests** - **VMM allocates contiguous physical memory for VMs** - Each machine gets a range defined by `addr0` and `memsize`
--- # Hypervisor Operation - VMM keeps in memory the CPU state for each VM, `vPSW` - Consists of `(M, B, L, PC)` - `M`: execution mode the VM **thinks** it’s running on: `s` when guest OS runs and `u` when a guest app runs
--- # Hypervisor Operation .small[ - **VMM starts/resumes VM execution by loading the hardware `PSW ← (M’, B’, L’, PC’)`** - `M’ ← u` - `B’ ← addr0 + vPSW.B` - `L’ ← vPSW.L` - `PC’ ← vPSW.PC` ]
--- # Hypervisor Operation - Note that any attempt by the guest to modify `M`, `B` or `L` will trap - Theorem hypothesis assumes all control-sensitive instruction are also privileged **More generally, what happens upon a trap?** --- # Hypervisor Operation What happens upon a trap to the hypervisor? 1. Hypervisor update `vPSW.PC` ← `PSW.PC` 2. Hypervisor looks at the instruction that trapped (`PSW.PC`) and emulates its semantics according to the ISA - Actions taken here depend if guest trapped from user or kernel space --- # Hypervisor Operation - **If guest OS caused the trap** (`vPSW.M == s`): sensitive instruction trapped, VMM handles it -- - E.g. if guest OS tries to update the segment register, the VMM checks and update `vPSW.B` and `vPSW.L` - Hardware `PSW.L` and `PSW.B` will be set accordingly when we return to VM execution: `PSW.B` ← `addr0 + vPSW.B` and `PSW.L` ← `vPSW.L` - The MMU is **transparently configured differently than what the guest OS requests** -- - VMM ensures the VM will resume at the next instruction: `vPSW.PC++` - VMM loads PSW and VM resumes --- # Hypervisor Operation - **If guest application caused the trap** (`vPSW.B == u`): app is doing a syscall/something illegal: trap should be handled by the guest OS -- - `MEM[addr0]` ← `vPSW`, save guest application state in the host-physical location of guest-physical `MEM[0]` - `vPSW` ← `MEM[addr0 + 1]` load guest OS state (OS entry point) from memory, after a check on the validity of `B` and `L` - VMM loads PSW to resumes VM (in guest OS mode based on the updated vPSW)
--- # Hypervisor Operation .medium[ - **According to the theorem hypothesis, all instructions updating the system state (control-sensitive) are privileged, so they will trap** - Includes instructions updating the virtual to physical mapping - Each of these needs to be **checked** by the VMM to ensure **safety** (isolation) - Each of these needs to be **emulated** to give each VM the illusion of exclusive and full access to physical memory ] --- # Hypervisor Operation .medium[ - **User/supervisor transition instructions (e.g. syscalls) need to be tracked by the VMM** - To keep `M == u` at all times in the VM in to ensure safety: VMM in complete control at all times - To keep track of `vPSW.M` to correctly emulate privileged instruction (behaviour-sensitive) according to the current guest privileged level (guest-user or guest-supervisor) to ensure equivalence - Such transition instructions are sensitive so tracking is possible ] -- .medium[ - **Still according to the hypothesis, behaviour-sensitive instruction will also trap** - E.g. reading `PSW.M` or `PSW.B` - Remember than the actual values are set by the VMM to something different from what the guest OS think they are - Need to be emulated by the VMM otherwise this would lead to programs behaving differently on bare-metal vs virtualised: equivalence requirements ] --- class: inverse, center, middle # Theorem Violations --- # Theorem: Counter-Examples - **Control-sensitive unprivileged instructions** - Update to the system state that does not trap! - E.g. unprivileged switch from supervisor to user mode with `JRST1` “return to user” in DEC PDP-10 issued from supervisor mode -- - **Behaviour-sensitive unprivileged instructions** - In particular instructions reading the system state that do not trap, violates the equivalence criteria - E.g. `POPF` fails silently and does not trap in user mode in x86-32 -- - **Instructions bypassing virtual memory** - If they don’t trap, the VM directly access physical memory, possibly outside of the range allocated by the VMM --- # Virtualising Unvirtualisable ISAs .medium[ Techniques to virtualise ISAs that do not adhere to the P&G criteria: - **Dynamic binary translation** (VMWare workstation): when the guest OS runs, the VMM fully **emulates** the ISA - Emulation is much **slower** than direct execution ] -- .medium[ - **Shadow Page Table** (VMWare): VMM tricks the guest into thinking it can install its own page table, traps on every modification to install/maintain its own - Also makes things **slower** ] -- .medium[ - **Paravirtualisation** (Xen): guest OS slightly modified to call the VMM upon operation necessitating non-virtualisable instructions (e.g. page table update) - Need to **modify** guest OS ] -- Compromises can be made on the **performance** and **equivalence** principles (not on the **security**!) --- class: inverse, center, middle # Wrapping Up --- # Wrapping Up - P&G: can build an hypervisor offering performance/safety/equivalence by running it in user mode - Theorem: **all sensitive instructions need to trap in user mode for an ISA to be virtualisable** - Demonstrated how to operate an hypervisor under a simplified CPU model