Virtualization 101 - The Theory of Virtualization

class: center, middle
background-image: url(include/title-background.svg)
# .right[Virtualisation 101]
## .right[The Theory of  
Virtualisation]

.right[Pierre Olivier  
.right[[pierre.olivier@manchester.ac.uk](mailto:pierre.olivier@manchester.ac.uk)]
]
---
class: inverse, center, middle

# Introduction

---

# The Popek and Goldberg Theorem

- **Seminal paper published in 1974 in Communications of the ACM**

.leftcol[
.medium[
> Popek, Gerald J., and Robert P. Goldberg. **"Formal requirements for virtualisable third generation architectures."** Communications of the ACM 17.7 (1974): 412-421.
]]

.rightcol[
  <div style="text-align:center"><img src="include/popek.png" width=300 style="box-shadow: 8px 8px 5px grey; border: .1rem solid black;" /></div>
]

---

# The Popek and Goldberg Theorem

- **Paper published in 1974 in Communications of the ACM**
- Original idea of the paper: **show that some contemporary architectures were not virtualisable**
  - DEC PDP-10 taken as a case study
--

- To that aim:
  - The paper describes the **criteria for a proper virtual machine monitor: *safety*, *efficiency*, *equivalence***
--

- It also defines accordingly the **requirements for an ISA to be virtualisable**
      - ISA: Instruction Set Architecture (x86-64, x86-32, aarch64, etc.)
      - Virtualisable: a VMM can be constructed on that ISA in a way that satisfy the aforementioned criteria

---
# The Popek and Goldberg Theorem

- Lack of popularity for virtualisation at the time the paper was published
  (70s)
--

- **Later, VMs become popular (end of 90s)**
  - **x86-32 (most popular ISA at the time) not virtualisable!**
  - In this paper P&G describe an ISA's requirements to support a VMM that
    itself support arbitrary guests, relying exclusively on direct execution
--
- This is now a **seminal paper on virtualisation**
  - AMD & Intel explicitly designed AMD-V and Intel VT-X for x86-64 in the 2000s to meet
  the criteria defined in the paper
--

- More importantly, we’ll also learn through this theorem the
  **fundamental principles behind hypervisor operation on virtualisable ISAs**

---
# The Popek and Goldberg Theorem

- **We explain the theorem  as follows:**
  - Explain P&G **simplified CPU model**
      - Simple hardware platform, but still representative of modern CPUs,  as
      a support for the theorem
--
  - Explain **how a regular OS would run on that simplified CPU
    model** (without virtualisation)
--
  - Give the **theorem**: what characteristics an ISA needs to exhibit in order to
    be able to run a VMM and VMs
--
  - **Describe a VMM  for that simplified CPU model**
      - Which properties it should satisfy to do its job properly
      - How it operates
--
  - Give some examples of **theorem violations** (ISAs not virtualisable)

---
class: inverse, center, middle

# Simplified CPU Model

---
name: segmentation

# Simplified CPU Model (1)

- One processor core with **2 execution modes: user and supervisor (AKA kernel mode)**
- **Physical memory** is contiguous, starts at `0`, size: `SZ`
- **Virtual memory implemented through segmentation** (no paging):
  - Single segment: Base `B`, Limit `L`
  - Virtual range `[0, L[` mapped to physical range `[B, B + L[`

---
template: segmentation
<div style="text-align:center"><img src="include/segmentation1.svg" width=800 /></div>

---
template: segmentation
<div style="text-align:center"><img src="include/segmentation2.svg" width=800 /></div>

---
template: segmentation
<div style="text-align:center"><img src="include/segmentation3.svg" width=800 /></div>

---
# Simplified CPU Model (2)

- CPU state: **Processor Status Word** (PSW): (`M`, `B`, `L`, `PC`)
  - Execution level `M` = {`s`, `u`} (supervisor or user)
  - Segment register (`B`, `L`) (physical address of base, and length)
  - The current program counter: `PC`, virtual address
      - Virtual address of the instruction currently executed

---
name: psw
# Simplified CPU Model (3)

.leftcol[
.medium[
- Support for **saving PSW content in memory** `MEM[0]` and loading a
   new value from `MEM[1]`
  - Action of entering the OS following an interrupt (also called **trap**)
- Support for **loading PSW content from a location in memory**
  - Exiting the OS after a trap processing
- Examples of traps: hardware interrupts, exceptions (software faults,
  system calls, **privileged instructions**)
]]

---
template: psw
.rightcol[
<div style="text-align:center"><img src="include/trap1.svg" width=450 /></div>
]

---
template: psw
.rightcol[
<div style="text-align:center"><img src="include/trap2.svg" width=450 /></div>
]

---
template: psw
.rightcol[
<div style="text-align:center"><img src="include/trap3.svg" width=450 /></div>
]

---
template: psw
.rightcol[
<div style="text-align:center"><img src="include/trap4.svg" width=450 /></div>
]

---
template: psw
.rightcol[
<div style="text-align:center"><img src="include/trap5.svg" width=450 /></div>
]

---
# Simplified CPU Model

- **OS operation without a hypervisor:**
  1. Kernel runs in `M` = `s`, apps in `M` = `u`
--

2. Kernel sets trap entry point during initialisation
      - `MEM[1] ← (M:s, B:0, L:SZ, PC:trap_entry_point)`
--
  3. Kernel allocates a contiguous range of physical memory (`B`, `L`) for each application
--

4. Kernel launches/resumes apps with address space `[B, B+L[`, currently executing `PC`:
      - `PSW ← (M:u, B:B, L:L, PC:PC)`
--
  5. At the trap entry point, kernel decodes the instruction `MEM[0].PC`, determines the cause of the trap and takes appropriate actions

---
# Building the Hypervisor

To build a hypervisor for that CPU, this is the research question asked by
  Popek & Goldberg:
--

- **Given a computer defined according to the model, under which conditions can
a hypervisor be constructed so that it:**
  - can execute one or more VMs;

--
  - is in complete control of the hardware at all times;

--
  - supports arbitrary, unmodified, and *potentially malicious* guest OSes designed for the same architecture; and

--
  - is efficient and show at worst a small performance decrease vs. non-virtualised execution?

---
# Building the Hypervisor

**To satisfy these requirements, the hypervisor needs to comply with these criteria**:
--

1. **Safety**
      - VMM in complete control of the hardware at all time
      - No assumption on guests, they can be malicious
      - VMM must enforce isolation between a VM and the VMM/hardware, and between VMs themselves (no shared state)
--
  2. **Equivalence**
      - VM is a duplicate of the underlying physical machine
      - Guest OS and apps behave the same natively and in a VM
          - They run *unmodified* in the VM
--
  3. **Performance**
      - Minimal decrease in a virtualised program execution speed

---
class: inverse, center, middle

# The Popek and Goldberg Theorem

---
name:theorem

# The Theorem

- We want **any guest CPU instruction that may allow escaping isolation to be caught
  by the hypervisor**
--

- Central idea behind constructing an efficient and secure hypervisor:
  - **run hypervisor in supervisor mode** and
  - **run guest apps *and guest OS* in user mode**
--

- Not doable with every ISA
  - Theorem tells us how an ISA should behave for such an hypervisor to work

---
# The Theorem

- We can classify the instructions of an ISA into 2 categories:
  1. **Sensitive instructions**, subdivided into:
      - **Control sensitive:** instruction that *updates the system state*
          - E.g. instructions modifying PSW in our example, `LGDT` on x86-32: load interrupt descriptor table
--
      - **Behaviour sensitive:** instruction whose *semantics depends on the value of the system state*
        (e.g. the privilege level)
          - E.g. `POPF` loads status register with data from the stack, works in
            supervisor mode but fails silently in user mode
--
  2. **Innocuous instructions:** instruction that are not sensitive
--

- We also define the concept of **privileged instructions**
  - Can only be executed in supervisor mode and traps when executed in user mode
      - E.g. `HLT` traps if `%cpl != 0` (x86-32's supervisor mode)

---
# The Theorem

- **Sensitive instructions**: update the system state or semantics depends on
  the value of the system state
- **Privileged instructions:** work only in supervisor mode, trap in user mode

Theorem:
> **For a given ISA, a VMM may be constructed if the set of sensitive instructions
  for that ISA is a subset of the set of privileged instructions**, i.e. if
<br>{control-sensitive} ∪ {behaviour-sensitive} ⊆ {privileged}

---
# The Theorem

---
# The Theorem

---
# The Theorem
**{control-sensitive} ∪ {behaviour-sensitive} ⊆ {privileged}**
- **Converse holds too: if the criteria is not met, a VMM cannot be constructed
  for that architecture**

--
- If a **control-sensitive** instruction does not trap, any guest can modify
      the system state without supervision/check from the VMM
      - E.g. a guest OS installing an arbitrary page table: **loss of safety**
--
- If a **behaviour-sensitive** instruction does not trap, the behaviour of such
  instructions will differ from non-virtualised execution when executed by the
  guest OS
  - Remember that the guest OS runs in user mode when virtualised
      - Guest OS instruction executed with user-level semantics: **loss of equivalence**

---
# Hypervisor Operation

Under these conditions, VMM operates as follows:
- For performance reasons we want to run as much guest code as possible
  directly on the CPU i.e. without involving (i.e. trapping to) the VMM
- **VMM runs in supervisor mode, guest OS runs in user mode**

---
# Hypervisor Operation

- **VMM allocates contiguous physical memory for himself, never accessible by
  guests**
- **VMM allocates contiguous physical memory for VMs**
  - Each machine gets a range defined by `addr0` and `memsize`

---
# Hypervisor Operation

- VMM keeps in memory the CPU state for each VM, `vPSW`
  - Consists of `(M, B, L, PC)`
      - `M`: execution mode the VM **thinks** it’s running on: `s` when guest OS runs and `u` when a guest app runs

---
# Hypervisor Operation

.small[
- **VMM starts/resumes VM execution by loading the hardware `PSW ← (M’, B’, L’, PC’)`**
  - `M’ ← u`
  - `B’ ← addr0 + vPSW.B`
  - `L’ ← vPSW.L`
  - `PC’ ← vPSW.PC`
]

---
# Hypervisor Operation

- Note that any attempt by the guest to modify `M`, `B` or `L` will trap
    - Theorem hypothesis assumes all control-sensitive instruction are also privileged

**More generally, what happens upon a trap?**

---
# Hypervisor Operation

What happens upon a trap to the hypervisor?

1. Hypervisor update `vPSW.PC` ← `PSW.PC`
2. Hypervisor looks at the instruction that trapped (`PSW.PC`) and emulates its
   semantics according to the ISA
  - Actions taken here depend if guest trapped from user or kernel space

---
# Hypervisor Operation

- **If guest OS caused the trap** (`vPSW.M == s`): sensitive instruction
  trapped, VMM handles it

--
  - E.g. if guest OS tries to update the segment register, the VMM checks and update `vPSW.B` and `vPSW.L`
          - Hardware `PSW.L` and `PSW.B` will be set accordingly when we return
              to VM execution: `PSW.B` ← `addr0 + vPSW.B` and   
              `PSW.L` ← `vPSW.L`
          - The MMU is **transparently configured differently than what the guest OS requests**
--
  - VMM ensures the VM will resume at the next instruction: `vPSW.PC++`
  - VMM loads PSW and VM resumes

---
# Hypervisor Operation

- **If guest application caused the trap** (`vPSW.B == u`): app is doing a syscall/something illegal: trap should be handled by the guest OS

--
  - `MEM[addr0]` ← `vPSW`, save guest application state in the host-physical
   location of guest-physical `MEM[0]`
  - `vPSW` ← `MEM[addr0 + 1]` load guest OS state (OS entry point) from memory,
    after a check on the validity of `B` and `L`
  - VMM loads PSW to resumes VM (in guest OS mode based on the updated vPSW)

---
# Hypervisor Operation

.medium[
- **According to the theorem hypothesis, all instructions updating the system
state (control-sensitive) are privileged, so they will trap**
  - Includes instructions updating the virtual to physical mapping
      - Each of these needs to be **checked** by the VMM to ensure **safety**
        (isolation)
      - Each of these needs to be **emulated** to give each VM the illusion of
      exclusive and full access to physical memory
]

---
# Hypervisor Operation
.medium[
- **User/supervisor transition instructions (e.g. syscalls) need to be tracked
  by the VMM**
  - To keep `M == u` at all times in the VM in to ensure safety: VMM in
        complete control at all times
  - To keep track of `vPSW.M` to correctly emulate privileged
        instruction (behaviour-sensitive) according to the current guest
        privileged level (guest-user or guest-supervisor) to ensure equivalence
  - Such transition instructions are sensitive so tracking is possible
]
--
.medium[
- **Still according to the hypothesis, behaviour-sensitive instruction will also trap**
  - E.g. reading `PSW.M` or `PSW.B`
      - Remember than the actual values are set by the VMM to something
        different from what the guest OS think they are
      - Need to be emulated by the VMM otherwise this would lead to programs
        behaving differently on bare-metal vs virtualised: equivalence 
        requirements
]

---
class: inverse, center, middle

# Theorem Violations

---

# Theorem: Counter-Examples

- **Control-sensitive unprivileged instructions**
  - Update to the system state that does not trap!
      - E.g. unprivileged switch from supervisor to user mode with `JRST1`
      “return to user” in DEC PDP-10 issued from supervisor mode
--
- **Behaviour-sensitive unprivileged instructions**
  - In particular instructions reading the system state that do not trap, violates the equivalence criteria
      - E.g. `POPF` fails silently and does not trap in user mode in x86-32
--
- **Instructions bypassing virtual memory**
  - If they don’t trap, the VM directly access physical memory, possibly outside of the range allocated by the VMM

---
# Virtualising Unvirtualisable ISAs

.medium[
Techniques to virtualise ISAs that do not adhere to the P&G criteria:
- **Dynamic binary translation** (VMWare workstation): when the guest OS runs, the VMM fully **emulates** the ISA
  - Emulation is much **slower** than direct execution
]
--
.medium[
- **Shadow Page Table** (VMWare): VMM tricks the guest into thinking it can install its own page table, traps on every modification to install/maintain its own
  - Also makes things **slower**
]
--
.medium[
- **Paravirtualisation** (Xen): guest OS slightly modified to call the VMM upon operation necessitating non-virtualisable instructions (e.g. page table update)
  - Need to **modify** guest OS
]
--
Compromises can be made on the **performance** and **equivalence** principles (not on the **security**!)

---
class: inverse, center, middle

# Wrapping Up

---
# Wrapping Up

- P&G: can build an hypervisor offering performance/safety/equivalence by running
       it in user mode
- Theorem: **all sensitive instructions need to trap in user
mode for an ISA to be virtualisable**
- Demonstrated how to operate an hypervisor under a simplified CPU model