CPU and Memory Virtualisation

The slides for this chapter are available here.

We have seen that x86-32, among other ISAs, was not virtualisable based on the Popek and Goldberg theorem. And that attempts at virtualising these had to compromise on performance or equivalence. Because of the high demand for virtualisation in the early 2000s, and the related problems with x86-32, the next generation ISA x86-64 that was first proposed in the early 2000s didn’t make the same mistake. x86-64 was designed with hardware-based virtualisation support in mind. This is achieved with Intel processors using 3 key technologies. VT-X for CPU virtualisation, Extended Page table (EPT) for memory virtualisation, and VT-d for I/O virtualisation. We’ll focus on Intel here but note that AMD, the manufacturer of x86-64 CPUs, has very similar technologies.

x86-64 CPU Virtualisation with VT-x

Motivation

Let’s start with CPU virtualisation. The existing software techniques to virtualise x86-32 had the following challenges. First, the guest OS runs in a privilege level it was not designed to run, which is user mode. With x86 privilege levels are called ring. Supervisor mode is ring 0 and user mode is ring 3. And we have guest OSes running for virtualisation in ring 3 while they were designed to run in ring 0. Second, the hypervisor needs to be located somewhere in memory and be inaccessible from the guests. Third, the performance impact of the traps, necessary to emulate every sensitive operation, is important. The traps representing guest-host transitions are frequent and costly, leading to important performance slowdowns.

The key design idea behind VT-X, which is x86-64’s hardware support for CPU virtualisation was to propose a holistic solution, and not to address separately all the issues x86-32 had that made it hard to virtualise. For example, changing the semantics of individual instructions such as POPF would be bad for retro compatibility reasons. x86-64’s designers rather addressed all issues by introducing a new mode of execution. The entire state of the CPU is duplicated into 2 modes: root mode for running the hypervisor and host operating system’s code, and non-root mode for virtual machine’s code.

VT-x Overview

The two modes are illustrated here: we have a machine with one hypervisor and host OS running in root mode in ring 0, and host-level application running in root mode in ring 3. We also have 2 VMs, each running a guest OS in non-root mode ring 0, and guest applications in non-root mode ring 3.

At any point in time the CPU is either in root or in non-root mode, and privilege levels (rings) are orthogonal to root/non-root modes and are available in both. Each mode has its own address space which is switched automatically upon transitions, including virtual memory translation caches. This allows the hypervisor and other host-level software to be well isolated from the guest software.

VT-x and P&G

Remember the key objectives for a proper hypervisor that we listed in the previous lecture. In terms of equivalence, the state of the virtualised CPU exposed by VT-X in non-root mode to VMs is an exact duplicate of the physical CPU state: guest can run x86-64 code but are also backward compatible with x86-32. Regarding safety, with architectural support, hypervisor codebase is much simpler, which leads to a reduced attack surface vs. approaches based on emulating the execution of the entire guest OS or those based on paravirtualisation, that need to maintain complex invariants. Finally, concerning performance, it was not a primary goal at first: the first generation VT-x CPUs were actually slower than state-of-the-art paravirtualised/OS emulation approaches.

With x86-64’s root and non-root mode, we can rework the Popek and Goldberg’s theorem as follows:

When executed in non-root mode, all sensitive instructions must either 1) cause a trap or 2) be implemented by the CPU and operate on the non-root duplicate of the CPU state

Having each sensitive instruction trap to the VMM in root mode would satisfy the equivalence and safety criteria. However, these traps are very costly, and we can’t have them be too frequent. Ideally we want as few traps as possible to keep performance close to native execution. Obviously managing the virtualisation of more privileged instructions in hardware means implementing more logic in the CPU, so there is a trade-off between hardware complexity and cost vs. performance here.

Root/Non-Root Transitions

Let’s see briefly how VT-X manages transitions between root and non-root mode. Assume the hypervisor is running in root mode at first. The hypervisor can start and resume a VM with the VMLAUNCH and VMRESUME instructions. Doing so, the CPU switches to non-root mode and starts to run the guest. The other way around, transitions from the VM to the hypervisor are called vmexits. The VM will transition to the hypervisor following a trap or an explicit call to switch to the hypervisor with the VMCALL instruction. In these cases the CPU switches from non-root to root mode and starts running hypervisor code for it to handle the trap.

When there is a VMEXIT, the CPU maintains a data structure containing information about the guest. For example what was the reason for a VMEXIT. It is called the virtual machine control structure, VMCS. And the hypervisor must use specific instructions to access it: VMREAD and VMWRITE.

These operations and the VMCS can be illustrated as follows:

One can see the list of categories of VMEXIT reasons here:

Category	Description
Exception	Guest instruction caused an exception (e.g. division by 0)
Interrupt	Interrupt from I/O device received during guest execution
Triple fault	Guest triple faulted
Root-mode sensitive	x86 privileged/sensitive instructions
Hypercall	Explicit call to hypervisor through `VMCALL`
I/O	x86 I/O instructions e.g. `IN`/`OUT`
EPT	Memory virtualisation violations/misconfigurations
Legacy emulation	Instruction not implemented in non-root mode
VT-x new	ISA extension to control non-root execution (`VMRESUME`, etc.)

VMEXITs happen when the CPU faults or invoke a system call, these are software exceptions, but also when an interrupt is received from an I/O device; or when the guest does a triple page fault, or when it invokes a sensitive instruction. The guest can also voluntarily trigger a VMEXIT, this is called a hypercall. The hypercall is for an hypervisor the equivalent of what a system call is for an operating system. There are other categories such as I/O instructions, memory virtualisation VMEXITs, other instructions that need to be emulated, and VT-X instructions themselves.

Introduction to KVM

KVM is an hypervisor integrated in the Linux kernel and leveraging VT-X on x86-64. KVM stands for kernel virtual machine. It’s a type 2 hypervisor designed within Linux from the ground up assuming hardware support for virtualisation, like VT-X for x86-64 and equivalent technologies for the other modern ISAs.

KVM is a module part of the Linux kernel code, so it lives in kernel space. KVM partially manages virtual machines, by doing things like handling traps, maintaining the virtual machine control structure, etc. Still, KVM must also rely on a user space program to handle other virtual machine management tasks, in particular resource allocation. That user space program is very often Qemu. Qemu is originally a machine emulator, but CPU and memory emulation can be disabled when running on top of KVM, because they are managed by VT-X and the memory virtualisation technology we will cover very soon: this makes things much faster, close to native performance. The couple KVM + Qemu is arguably the most popular hypervisor today.

From a high level point of view, KVM and Qemu cooperate to run a virtual machine as follows:

Assume a VM executes and the CPU is in non-root mode. When there is a trap, the CPU switches to the host in root mode and KVM code starts to run. KVM considers the reason of the trap to handle it. Many traps can be simply handled by looking at the VMCS: some instruction need to be emulated, some require to inject a fault into the VM, some require a retry, and other require to do nothing.

In some cases VMCS’ content is not enough to handle the trap, and KVM embeds a general purpose x86 instruction decoder and emulator to manage them. The instruction causing the trap is fetched from guest memory, decoded and verified. If there is any, memory operands are read from the guest’s memory. Then the instruction is emulated, result operand are written in the guest memory if needed, and the guest registers (including the PC) are updated. Finally, the VM execution can resume.

Traps related to I/O devices are handled by Qemu. As we mentioned it’s a user space program and for that the CPU must transition to the host’s user space.

x86-64 MMU Virtualisation with EPT

We have covered the CPU, let’s now talk about hardware-assisted memory virtualisation for x86-64. The first iterations of x86-64 did not have support for hardware assisted memory virtualisation, only VT-X for the CPU. They assumed a disjoint page table for the root and non-root mode, which was efficient to isolate the hypervisor from the guest by making sure they cannot map it. However, every guest page table update still needed to trap to the hypervisor to be validated, to make sure the guest does not try to map something it should not have access to. This is called shadow paging, and it is notoriously slow because page table updates are quite frequent.

Without hardware support for MMU virtualisation, another option is paravirtualisation, i.e., modify the guest not to update page tables directly, but rather to request the hypervisor to do so in a controlled fashion. As we saw, paravirtualisation breaks equivalence so that solution is not ideal either.

Extended Page Tables: Introduction

There was a need for hardware support for memory virtualisation, similar to what VT-X does with the CPU. The technology for memory virtualisation is called extended page tables and was presented in this seminal paper in 2008:

R. Bhargava et al., Accelerating Two-Dimensional Page Walks for Virtualised Systems, ASPLOS’08

With EPT the guest OS maintains its page tables normally. It can update them freely without traps to the hypervisor. There is one page table per guest process, and it maps guest virtual to guest pseudo physical addresses. They key idea behind EPT is to add a second level of address translation, the extended page table. There is one extended page table per VM, and it maps guest pseudo physical addresses to host physical addresses:

They hypervisor is in total control of these extended page table, hence it can make sure the guest OSes map only the memory they can access.

With performance in mind, having to walk 2 levels of page table is concerning. Still, EPT is designed in such a way that the translation caches, the translation lookaside buffers, will cache directly the guest virtual to host physical mapping. Knowing that the TLB hit rate is about 95% in modern CPUs, there is no need to walk two levels of page tables for the majority of VM memory accesses. However, if there is a TLB miss, then these two levels must be walked: the guest’s page table, and the extended page table.

EPT Walk

Before explaining EPT walk, let’s see how a traditional (non virtualised) page table is walked by the MMU to perform address translation upon a memory access: The page table is rooted in the %cr3 register. Different parts of the target virtual address will index each level of the page table. Until the target data page is found. An offset from the beginning of that page is added based on the least significant bits of the address, to find the target byte to load or store.

This is illustrated here:

We have on the left the virtual address the CPU wants to access, and on the right the page table. The goal of the page table walk is to find the physical address corresponding to virtual one in order to perform the memory access.

The %cr3 register contains the physical address of a page constituting the root of the page table. On standard CPUs page tables are a tree with 4 levels and the root is the 4th level. That address contains 512 64-bit entries, each of them being a pointer to a page of the next (3rd) level of the page table. The root of the page table is indexed by the bit 39 to 47 of the virtual address the CPU wants to access (note that most modern CPUs don’t use the full 64 bits of a virtual address, but rather 48). This selects an entry in the root page, indicating what 3rd level page to use next. The bits 30 to 38 of the address are used to index that page, giving us the 2nd level page, indexed with bits 21 to 29, giving us the 1st level page, indexed with bits 12 to 20. The 1st level page table entry points to the page containing the physical address we need, and that page is finally indexed with the bits 0 to 11 to find the target slot.

When running virtualised, address translation must walk both the guest page table and the extended page table. Things work as follows:

We have the virtual address targeted by the guest on the left. The guest page table is rooted in cr3. It contains a guest pseudo physical address, so we first need to translate it into a host physical address. So we need to walk the extended page table to figure out which physical page contains the root of the page table, which is level 4. Once we found it, it can be indexed with the most significant bits (bits 39 to 47) of the target address, which gives us the address of the next level page (3rd level) However that address is a guest virtual address, and we need to similarly transform it into a host physical one, so we walk the extended page table again

Rinse and repeat to find the 2nd level page, 1st level page, then the target data page. Which can finally be indexed by the least significant bits (0 to 11) of the target guest virtual address to find the byte the guest wants to load or store.

In the end, to walk the 2D page table, we had to do 24 memory accesses to load or store a single byte. That is compared to 4 memory accesses to walk a standard page table when we run non-virtualised. That’s a very high overhead, but remember that 95% of the guest memory access don’t need to go through this because they are hits in the translation cache. (the TLB – Transaction Lookaside Buffer).

Memory Virtualisation in KVM

KVM makes of course use of extended page tables to manage the VMs’ memory. Things work as follows:

The guest manages its own page tables, one per guest process, with minimal intervention from KVM. Qemu lives in the host user space as a regular process. Like every other process it has its own virtual address space. Qemu makes a large call to malloc to allocates a large contiguous buffer that will be the guest’s pseudo physical memory.

In the host kernel lives the KVM module. It sets up and manages the extended page table that maps the guest pseudo physical addresses to the host physical memory.

Sometimes the Qemu process needs to read and write in the VM’s memory too, for example when virtualising I/O as we will see next. For that it can just read and write in that large area of virtual memory it allocated for the VM, and the page table of Qemu on the host will be used for the translation like any other host process.

Keyboard shortcuts

Virtualisation 101