Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Theory of Virtualisation

The slides for this chapter are available here.

Here we’ll talk about how virtualisation works from the theoretical point of view. Why study the theory of virtualisation? First, it will help us understand the core requirements of virtualisation, and the characteristics that an instruction set architecture (ISA) needs to have to be virtualisable. And second, it will allow us to understand the working principles of virtualisation on a hypothetical model of processor which is much simpler than the CPUs we use today.

The Popek and Goldberg Theorem

We already mentioned this seminal virtualisation paper published in 1974 in Communications of the ACM:

Popek, Gerald J., and Robert P. Goldberg. “Formal requirements for virtualisable third generation architectures.” Comms. of the ACM 17.7 (1974): 412-421.

You can access the paper’s PDF by clicking on its picture above. It is well worth a read, in particular Sections 1 to 5. That being said, we’ll do a good summary of these sections here.

The original idea of the paper was to show that, at the time, some contemporary architectures were not virtualisable. Recall that by virtualisable here we mean that we want to be able to run efficiently several operating systems on the same machine The authors take in the paper an at-the-time popular ISA as a case study: the DEC PDP-10, which was not virtualisable based on a series of reasons defined in the paper.

To define the criteria an ISA must satisfy to be virtualisable, the paper starts by describing the key properties that a virtual machine monitor/hypervisor must present. These requirements are safety, efficiency, and equivalence. We’ll develop on these later. The paper then defines what is now known as the Popek and Goldberg Theorem, listing the requirements for an ISA to support such a VMM, i.e., to be virtualisable.

At the time in the 1970s the paper did not make a lot of noise because virtualisation was not a popular topic. Later, with computers becoming much more common and widespread, there was a big regain of popularity for virtualisation. By the end of the 1990s, many actors were looking to run efficient virtual machines, however the most popular ISA at the time, which was Intel x86-32, was not virtualisable: indeed, it did not satisfy the requirements pointed out in the paper written by Popek and Goldberg. Following that, in the 2000s, when the 64 bits instruction sets (e.g., x86-64 or ARM64) we still use today were created, their designers took great care to follow the principles defined in the paper And they succeeded, as on these ISAs we can run virtual machines very efficiently.

By studying this paper we will also learn about the working principles if virtualisation, and how a virtual machine monitor works when it runs on a virtualisable ISA. We will explain the theorem as follows. We’ll first describe the simplified model of processor presented in the paper, and will then present how a regular operating system would run without virtualisation on top of that processor model. Next, we will present the Popek and Goldberg theorem, which lists the characteristics that an ISA needs to have to be virtualisable. Then we will then describe how our simplified CPU model can be virtualisable and how a virtual machine monitor would work on that CPU. Finally, we will briefly give examples of ISAs that do not satisfy the theorem, and we’ll see how they cannot be virtualisable concretely.

Simplified CPU Model

Hardware Mechanics

This processor has 2 privilege levels, user and supervisor. The physical memory it can access is contiguous, starts at address 0 and is of size SZ. Virtual memory is supported and an application running on the CPU accesses the virtual address space with loads and stores which are mapped to physical memory by the MMU. Virtual memory is implemented through segmentation (remember this paper is very old): at any time the virtual address space seen by what is running on the CPU ranges from virtual address 0 to virtual address L, and it is mapped to a segment of physical memory from physical address B to physical address B + L. This is illustrated here:

The CPU state is composed of 4 control registers that together form the Processor Status Word (PSW). The control registers are as follows:

  • The current privilege level M (for mode), which can take the value S for supervisor mode, or U for user mode.
  • The segment register (B, L): as mentioned previously these are the physical addresses of the base of the currently mapped segment, and its length.
  • The program counter that points in virtual memory to the current instruction being executed.

A complete view of our processor so far is as follows:

We also need CPU support for entering the OS following an interrupt or exception. This is also called a trap. It works as follows. Assume the CPU runs an application in user mode. There is a trap, for example because the application is making a system call. The PSW at that stage represent the CPU state for the application, and it needs to be saved somewhere in memory. There is a dedicated space for that, it’s the first slot in memory MEM[0]. We also need to switch to the kernel, so we load the kernel CPU state that was previously saved in another dedicated location, MEM[1]. This kernel state loads the kernel memory segment which simply gives the OS access to the entire physical memory, and set the program counter to a predefined kernel code location which is the trap entry point. Of course the privilege level of the kernel PSW is set to supervisor mode. When this is done the kernel starts to run and processes the trap:

When the kernel is done processing the trap, to return to user space the PSW is loaded with the previously-saved application CPU state from MEM[0]. The application can then resume:

Non-Virtualised OS Operation

When running software without virtualisation, the OS kernel runs in supervisor mode (M = s) and application run in user mode (M = u). When it boots, the kernel loads MEM[1] with the kernel state the CPU should take upon a trap: M:s, B:0, L:SZ, PC:trap_entry_point. That is: supervisor privilege level, access to all physical memory with B = 0 and L = SZ, and the program counter set to the trap entry point in the kernel code.

For each application the kernel allocates a contiguous range of physical memory defined with (B, L). For security reasons the kernel makes sure the segments given to applications do not overlap. The kernel launches and resumes an application which address space is defined by segment [B, B+L] and wants to run instruction pointed by PC by loading its PSW with the following values: (M:u, B:B, L:L, PC:PC).

Finally, when the kernel runs after a trap, it decodes the application instruction that caused the trap and take action.

The Popek and Goldberg Theorem

Hypervisor Objectives and Requirements

Back to the paper, the authors ask the following research question: what are the requirements for building an hypervisor for that CPU in such a way that the hypervisor:

  1. can execute one or more virtual machines;
  2. can support any operating system designed to run non-virtualised for that CPU;
  3. supports arbitrary, unmodified, and potentially malicious guest OSes designed for the same architecture;
  4. is in complete control of the hardware at all times; and
  5. is efficient and show at worst a small performance decrease vs. non-virtualised execution?

To reach these objectives, the hypervisor needs to comply with the following 3 fundamental requirements:

  1. Safety: the VMM must be in complete control of the hardware at all time. It should not assume that the guests code (guest applications/OS) will behave correctly, in fact it should assume the guest can be malicious. For that reason the VMM must enforce isolation between a VM and the VMM/hardware, and between VMs themselves.

  2. Equivalence: a VM should be a duplicate of the physical hardware, and the guest OS and applications should not have to be modified to run in a VM. Their behaviour in a VM should be exactly the same as if they were running natively.

  3. Performance: when running virtualised, the guest OS and application should see a minimal performance slowdown compared to native, non-virtualised, execution.

To satisfy the performance criteria we want to run as many guest instructions as possible directly on the CPU. However, to satisfy the safety criteria we want to make sure that any guest instructions that may allow it to escape the virtualisation isolation will trap to the hypervisor. This way the hypervisor can emulate that instruction safely without breaking the isolation between VMs and itself. Hence, the central idea behind constructing an efficient and secure hypervisor is to run the hypervisor in supervisor mode and run the guest applications and the guest OS in user mode. The hope is that every instruction that would allow a guest to escape the isolation enforced by the hypervisor would be forbidden in user mode hence if a guest attempt to execute such instructions they would trap to the hypervisor to be emulated.

This is not doable with every architecture, the that we’ll present next will tell us the properties an ISA should have so that we can build such a virtual machine monitor on that ISA.

Classifying Instructions

The last thing we need before presenting the theorem is to classify instructions.

The first category is called sensitive instructions. It is subdivided into 2 subcategories:

  1. Control-sensitive instructions: these are the instructions that update the system state, for example the instructions modifying the PSW in our example, or LGDT on Intel x86-32 which allows installing new interrupt handlers.

  2. Behaviour-sensitive instructions: these are the instruction whose semantics depends on the value of the system state such as the privilege level. An example here is POPF on x86-32 that loads a status register with data from the stack: it works fine in supervisor mode (ring 0) but fails silently in user mode (ring 3).

The instructions that are not sensitive are named innocuous instructions: they do not update the system state and their behaviour does not depend on it either.

A second category of instructions is privileged instructions. We already cover these in a previous lecture: they can only be executed in supervisor mode and traps when executed in user mode. An example of such instruction is HLT, that shuts down the CPU in supervisor mode but rather traps if executed in user mode on x86-32. Instructions can be privileged or not independently of their sensitive/innocuous nature.

The Theorem

We can now gives the Popek and Goldberg theorem that specifies the requirements for an instruction set to be virtualisable:

For a given ISA, a VMM may be constructed if the set of sensitive instructions for that ISA is a subset of the set of privileged instructions, i.e.
if {control-sensitive} ∪ {behaviour-sensitive} ⊆ {privileged}

The theorem simply states that for a given ISA to be virtualisable, the set of all sensitive instructions need to be a subset of the privileged instructions. In other words, every sensitive instruction must trap when executed in user mode.

In the diagram above, we can see a virtualisable ISA on the left, satisfying the theorem requirement. On the right we have a non-virtualisable ISA: a subset of sensitive instructions are not privileged and will not trap when executed in user mode.

So why can’t an ISA be virtualised if some of its sensitive instructions do not trap when executed in user mode? Recall that with the hypervisor we want to build, both the guest applications and guest OS run in user mode. If you consider a control-sensitive instruction that would not trap in user mode, any guest could update the state of the system without supervision from the hypervisor. Imagine a guest being able to install an arbitrary segment register or an arbitrary page table and mapping physical memory it is not supposed to access. That would break the safety criteria the hypervisor needs to maintain, and things wouldn’t work. If you now consider a behaviour-sensitive instruction that does not trap in user mode, it means that the guest OS, when executing this instruction and expecting the supervisor mode behaviour, will actually see the user mode behaviour: this breaks the equivalence criteria the hypervisor needs to maintain and once again things do not work.

Hypervisor Operation

Basic Principles

Let’s now discuss how an hypervisor that satisfies the Popek and Goldberg requirements would work, with the goal of reaching the safety, equivalence, and performance objectives. The VMM operates as follows: for performance reasons we want to run as much guest code as possible directly on the CPU without trapping. So as we mentioned, the hypervisor will run in supervisor mode, and the guest including its operating system will run in user mode:

The hypervisor will reserve some contiguous memory for itself. That memory should never be accessed by the guest for obvious security reasons. The hypervisor will also allocate contiguous ranges of physical memory, one for each VM We can define each VM range with a base physical address addr0, and a length, memsize:

For each VM the hypervisors keeps in memory a data structure that represents a software model of what the VM thinks is the current PSW. We call it the virtual PSW, vPSW. It has the same registers as the real PSW: a privilege level M, which is user when the VM runs an application and supervisor when the guest OS runs; a segment register (B, L) representing the address space of what is currently running in the VM; and a program counter PC. This can be illustrated as follows:

Starting/Resuming a VM

When starting a VM or when resuming the execution of a VM following a trap, the hypervisor loads the real PSW as follows:

  • As explained previously the real privilege level of a VM is always user mode: M’ ← u.
  • The real segment base address is the base address of the physical memory allocated to the VM, addr0, plus the base address of the vPSW, vPSW.B, so we have B’ ← addr0 + vPSW.B.
  • The real segment length is the vPSW length: L’ ← vPSW.L.
  • The real program counter is that indicated in the vPSW: PC’ ← vPSW.PC.

The hypervisor resuming a VM in the state presented on the last diagram can be illustrated as follows:

For security reasons, addresses from the vPSW are checked by the hypervisor before being loaded in the real PSW so that we don’t load in the real PSW something that would go beyond the limits of what is allocated to the VM. Further, any attempt by the guest to modify M, B or L will trap: the theorem hypothesis assumes all control-sensitive instruction are also privileged. Because the guest always run in user mode (independently of the value of vPSW.M), they will trap when the guest OS uses them. Hence, the hypervisor can maintain correct values for the vPSW, as it needs to always keep track of what the VM things the PSW is.

Handling Traps

When the VM traps, the hypervisor keeps the vPSW up to date by writing the trap’s PC in it: vPSW.PCPSW.PC. Based on the instruction that caused the trap, the VMM will emulate a non-virtualised machine’s behaviour. What to do depends on what the guest was running when it trapped, was it application code, or kernel code.

If the guest kernel caused the trap (vPSW.M is s), it means the guest OS probably executed a sensitive instruction. The VMM handles that depending on the instruction in question. For example if the guest is trying to update the segment register, the vPSW is updated with what the guest wants to write in there. When we return to the guest from the trap, the real PSW will be updated with the method we just described. This way the MMU is configured differently from what the guest requests, but this is also completely transparent from the guest point of view. Before returning to the guest the VMM will set the vPSW’s PC to the next instruction to denote the fact that the emulated instruction ran successfully.

Upon a trap happening while the guest was running application code (vPSW.M is u), the application is either doing a system call or a fault like a division by zero. To manage that fault the guest OS needs to run: the hypervisor needs to emulate a transition to the guest OS. The hypervisor starts by saving the guest application’s state (which is the vPSW’s value) inside the VM’s memory in the dedicated location which is the VM’s equivalent of MEM[0]. Then it loads into the vPSW the guest OS state from the guest’s memory equivalent of MEM[1]. Finally, it loads the real PSW based on the method we presented previously to resume the VM and start to execute its kernel. This can be illustrated as follows:

On this diagram the vPSW is a data structure present somewhere in the hypervisor’s memory in red. It is copied in the VM’s MEM[0] (step 1) to emulate saving the state of the application that was running in the VM. The guest OS’ state is loaded in the vPSW from the VM’s MEM[1] (step 2). The vPSW will then be restored into the hardware PSW as previously described.

Theorem Violations

Going back to the theorem, because all guest control sensitive instructions that update the state of the system trap to the hypervisor. The hypervisor can check them, for example to make sure a VM does not map memory outside of what it can access. The hypervisor can also emulate them, to give each VM the illusion that it’s in total control of the hardware like it would when running natively.

The transition instructions between the guest application and kernel need to trap to the hypervisor, so that it can update the mode register of the vPSW. This way the hypervisor knows when a trap originates from the guest kernel or application, and it can emulate it accordingly. Transition instructions are sensitive, so they will trap. Behaviour sensitive instructions, for example reading the values of the PSW, will also trap. Once gain the real PSW is loaded with values that are different from the guest OS thinks is in the PSW. So if the guest OS it tries to read these registers, the hypervisor will return the emulated values, so we can maintain equivalence.

Many ISAs proposed between the 70s-2000s violated the theorem and were not virtualisable properly. A prime example here is x86-32, which had POPF, a behaviour-sensitive instruction that did not trap but rather failed silently when executed in user mode. When executed in supervisor mode, this instruction was used by the OS to query important information about interrupts So, assuming an x86-32 virtualised guest OS running in user mode, these queries would fail silently, and the OS would misbehave by acting on garbage interrupt information. This was a problem because the demand for virtualisation became quite high at that time x86-32 was the most popular ISA.

Another example of theorem violation was the DEC PDP-10, which had a JRST1 instruction performing a world switch returning to user mode from supervisor mode. That instruction would not trap when run in user mode, hence with a virtualised guest OS the hypervisor would be unable to catch it to keep track of the mode the VM thinks it is running in (user/supervisor).

Because of a growing need for virtualisation by the end of the 1990s/early 2000s, techniques were developed to try to virtualise ISAs violating the theorem. Each of them had to compromise on some of the key objectives of virtualisation. For example, introducing more emulation by running the entire guest OS or at least every access to the guest page table as emulated would allow virtualising x86-32, but it was very slow, breaking performance. Another approach was paravirtualisation, where the guest OS was modified to be virtualisation-aware (e.g. not to issue any of the sensitive instructions that did not trap in user mode), breaking equivalence. Overall, it was OK to compromise on performance or equivalence, but of course never of safety.