class: center, middle ### Secure Computer Architecture and Systems *** # I/O Virtualisation ??? - Hi everyone - After we covered how CPU and memory are virtualised on modern ISAs - Let's now talk about I/O virtualisation --- # I/O Interposition - Same as CPU/Memory, for I/O virtualisation software methods predates hardware support
??? - Similar to CPU and memory virtualisation, the first attempts at virtualising I/Os were achieved in software without hardware support - This is called I/O interposition - The hypervisor creates a software model of a virtual I/O device, that the guest OS will access using a driver, similarly as if the device was a real, physical one - The hypervisor is also in charge of connecting the virtual device to the real devices on the host to actually perform I/Os such as accessing the filesystem or the network --- # Benefits of I/O Interposition - **I/O device consolidation**: map multiple virtual devices to a smaller set of physical devices - Let several VMs share one/a small number of device(s) - Increase utilisation and efficiency, reduces costs ??? - Virtualising devices this way has many benefits, - One is device consolidation: we can create many virtual devices on top of a smaller number of physical devices - For example you can have a host with a single hard disk running several virtual machines, each with its own virtual disk - This allows to reduce cost and increases device utilisation -- - **Aggregate several physical devices into a single virtual one** for example for performance reasons ??? - Conversely, several physical devices can also be aggregated into a single virtual one, with the hope of getting higher throughput -- - **Hypervisor can capture the entire state of the device** - Useful for VM suspend/resume - and for migration, including between hosts with different physical devices ??? - Because the virtual device is implemented in software, it is easy for the hypervisor to capture its state at a given point in time - This is quite useful to enable features such as virtual machine suspend/resume, or migration, including between hosts equipped with different models of physical devices -- - **Add features not supported by physical devices**: e.g. disk snapshot, compression, etc. ??? - Device virtualisation can also enable feature that are not normally supported by physical devices, for example taking disk snapshots, compressing or encrypting I/Os, etc. --- class: center, middle, inverse # Physical I/O ??? - Before diving into virtual I/Os, let's briefly talk about how I/O work on a non virtualised machine --- # Physical I/O .leftcol[
] .rightcol[.medium[ - 3 main mechanisms for devices to interact with the system: ]] ??? - Overall there are 3 ways for the system and devices to interact --- # Physical I/O .leftcol[
] .rightcol[.medium[ - 3 main mechanisms for devices to interact with the system: - **Port-based or Memory-mapped I/O** - Basic form, CPU reads/writes to addresses mapped to device registers - **Interrupts** - Sent by devices to the CPU for notification - **Direct Memory Access** - CPU starts transfer, later notified when it completes ]] ??? - First, port-based or memory-mapped I/O - With this method, device registers are mapped somewhere in the address space, and when these addresses are read or written, the CPU actually communicates with the devices and reads or writes into the device's registers - This method of communication is unidirectional, from the CPU to the device, and can only transmit very small register-sized messages - For example when the CPU configures the network card to enable networking, this is done through memory mapped I/Os - Second, interrupts: these are unidirectional signals sent from the device to the CPU - Interrupts are a form of notification, they don't carry data - For example when the network card wants to notify the CPU that a packet has been received and should be fetched, it uses an interrupt - Third, direct memory access: DMA is bidirectional, used to transfer large quantities of data between the CPU and the device - For example when the CPU wants to send or receive data to the network through the network card, it uses DMA --- # Data Transfer through Ring Buffers .left3col[
] .right3col[.medium[ **Ring buffer: producer-consumer circular buffer** - Area in memory for CPU/device streaming communication - Described in device's registers setup by the CPU using MMIO/PIO ]] ??? - Large data transfers between the CPU and I/O devices are realised with the use of ring buffers in memory - A ring buffer is a consumer-producer system, generally enabling unidirectional communication - To establish such communication the CPU configures the device with memory-mapped I/O and indicates in some of the device's control registers information about the ring buffer - What is its base address and length, and what are the head and tail pointers --- # Data Transfer through Ring Buffers .left3col[
] .right3col[.medium[ - In the case of CPU to device transfer: - Device consumes from `head` and update the pointer ]] ??? - When we have CPU to device transfers, the device consumes data from head and update the head --- # Data Transfer through Ring Buffers .left3col[
] .right3col[.medium[ - In the case of CPU to device transfer: - Device consumes from `head` and update the pointer - CPU produces at `tail` and update the pointer ]] ??? - And the CPU produces data at the tail pointer, updating it too --- # Data Transfer through Ring Buffers .left3col[
] .right3col[.medium[ - **Synchronisation** happens through: - MMIO for CPU -> device (e.g. to init. device or start a DMA transfer) - Interrupts for device -> CPU (e.g. to signal the end of a DMA transfer or device-level events) ]] ??? - Because the memory is shared between the device and the CPU, they need to synchronise - Memory-mapped I/O is used for CPU to device synchronisation, for example to signal the start of a DMA transfer - And interrupts are used for device to CPU synchronisation, for example to notify the end of a DMA transfer --- class: center, inverse, middle # I/O Virtualisation without Hardware Support ??? - Now how can this be virtualised, first entirely in software without hardware support --- # I/O Emulation - The OS ⇔ device interface is in essence simple: - OS discovers and controls devices with MMIO/PIO - Devices respond with interrupts and DMA ??? - A first technique is I/O emulation - We have seen that the interface between the OS and devices is quite simple - The OS discovers and control devices with memory-mapped I/O, and devices respond with interrupts and DMA -- - **The hypervisor can completely emulate devices by:** - Making sure that every MMIO/PIO operation traps - PIO are made through sensitive instructions, will trap - Map I/O memory as RO/not mapped ??? - The hypervisor can create a virtual device that entirely emulate the behaviour of a device behind that same interface exposed to the guest OS - Of course we need that every I/O-related action done by the guest OS traps - Memory-mapped I/Os are done with sensitive instructions, so they will indeed trap - The hypervisor also needs to map DMA memory as inaccessible so that any access will trap -- - Injecting interrupts to the guest - By calling the handlers registered by the guest in a virtual interrupt controller ??? - Concerning device to I/O notification, the hypervisor can also emulate them by injecting interrupt in the guest - This is done by calling the handlers registered by the guest in the virtual interrupt controller which is also handled by the hypervisor -- - Reading/writing from/to guest physical memory to emulate DMA ??? - Finally to emulate DMA, the hypervisor can simply read and write to the relevant guest memory areas --- # Qemu/KVM & I/O Virtualisation - Qemu/KVM will spawn: - **1 thread per vCPU** (virtual core) - **1 thread per virtual device** - E.g., VM with 2 vCPUs and 2 devices:
??? - With KVM and Qemu, the hypervisor uses 1 thread for each virtual core of the VM, we call these virtual CPUs, vCPUs - It also creates 1 thread for each virtual device - You have a VM with 2 virtual cores and 2 virtual devices illustrated on the slide --- # Qemu/KVM & I/O Virtualisation
- Device threads handle most I/O operations asynchronously ??? - Assume the VM is running and one of the cores wants to perform I/O - It initiates memory-mapped I/O communication with the virtual device - This will trap to the hypervisor, that will defer the treatment of that I/O to the thread managing the virtual device - Assuming this is a long operation like a DMA transfer, the guest will resume after the transfer starts, and the hypervisor will inject an interrupt later when the transfer is one - This mimics exactly what happens with a real device --- # Qemu/KVM & I/O Virtualisation - `lspci` output in a VM with 2 Network Interface Controllers (NICs): ``` 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Device 1234:1111 (rev 02) *00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03) *00:04.0 Ethernet controller: Red Hat, Inc Virtio network device 00:05.0 Communication controller: Red Hat, Inc Virtio console ``` Illustrates the 2 main I/O virtualisation methods without hardware support: - **Full emulation** (Intel NIC) - **Paravirtualised I/O** (virtio NIC) ??? - This is an example of what listing PCI devices in a standard Linux Qemu/KVM machine outputs - You can see in this list 2 network cards - The first, the Intel one, is a fully emulated device, working as we just described - The second, the virtio one, is called a paravirtualised device - We'll develop a bit on this type of device virtualisation soon --- # Full Device Emulation - Detailed view for the Intel virtual NIC: ``` 00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03) Subsystem: Red Hat, Inc QEMU Virtual Machine Physical Slot: 3 Flags: bus master, fast devsel, latency 0, IRQ 11 * Memory at febc0000 (32-bit, non-prefetchable) [size=128K] I/O ports at c000 [size=64] Expansion ROM at feb40000 [disabled] [size=256K] * Kernel driver in use: e1000 Kernel modules: e1000 ``` - 82540EM: very old but widespread NIC, any modern OS will have the corresponding driver (e1000) - Registers mapped in physical memory at address `0xfebc0000` ??? - If we zoom in on the Intel fully emulated network card, we can see the address in memory where memory mapped I/O registers are located, it's `0xfebc0000` - We can also see the name of the driver used for that virtual device: e1000 - This model of network card is quite old but also very widespread, so we know most OSes will have a driver for it, which is good for compatibility --- ## Full Device Emulation: e1000 registers .small[ | Category | Name | Abbreviates | Offset | Description | | -------- | ---- | ----------- | ------ | ----------- | | Receive | `RDBAH` | Receive descriptor base address | `0x02800` | Base address of Rx ring | | Receive | `RDLEN` | Receive descriptor length | `0x02808` | Rx ring size | | Receive | `RDH` | Receive descriptor head | `0x02810` | Pointer to head of Rx ring | | Receive | `RDT` | Receive descriptor tail | `0x02818` | Pointer to tail of Rx ring | | Receive | `TDBAH` | Transmit descriptor base address | `0x03800` | Base address of Tx ring | | ... | ... | ... | ... | ... | | Other | `STATUS` | Status | `0x00008` | Current device status | | Other | `ICR` | Interrupt cause read | `0x000C0` | Cause of the last interrupt | | ... | ... | ... | ... | ... | For example to read the cause of the last interrupt (`ICR` command), CPU reads at physical address base + offset: `0xfebc0000` + `0xc0` == `0xfebc00c0` ] ??? - If you look in the datasheet of the physical version of the Intel network card, you will find the list of memory mapped I/O registers - Each of them has a particular offset from the base address where these registers are mapped in memory - Each also has a particular purpose: setting up ring buffers for DMA, indicating what is the status of the device, what was the cause of the last interrupt, and so on - The emulated model for this device as implemented in Qemu mimics exactly the behaviour of the real network card when each of these registers is read or written by the virtual machine --- # Full Device Emulation - Trap to KVM and then Qemu for each interaction with the device' registers - Qemu completely emulates in software the behaviour of the NIC - Implemented in `hw/net/e1000.c` (https://github.com/qemu/qemu/blob/master/hw/net/e1000.c) - For example for `ICR` (see the NIC's behaviour description in the data sheet here: https://bit.ly/3GCubZu): ```c static uint32_t mac_icr_read(E1000State *s, int index) { uint32_t ret = s->mac_reg[ICR]; set_interrupt_cause(s, 0, 0); return ret; } ``` ??? - Each interaction with the device's memory mapped registers traps to KVM first, which redirect I/O management to Qemu - If you check out the link on the slide you can see the code for emulating the Intel network card - It's not very large, less than 2 thousand lines of code - You have a small excerpt here which is the code executed when the VM wants to run the memory mapped register corresponding to the reason for the last interrupt - As you can see that information is held in a data structure and is returned to the VM by the emulation code --- # I/O Paravirtualisation (PV) - **Most hardware devices have not been designed with virtualisation in mind** - E.g. sending/receiving a single Ethernet frame with e1000 involves multiple register accesses, i.e. several costly vmexits ??? - Full emulation is great for compatibility because we are emulating real devices that we know existing guest OSes will have the drivers for - However, these real devices have never been designed with virtualisation in mind - For that reason, communication between the VM and the emulated involves a lot of vmexits, which are quite costly and hurt performance -- - **I/O paravirtualisation: virtual devices implemented with the goal of minimising overhead** - At the cost of installing PV drivers in the guest, i.e. modifying the guest OS ??? - I/O paravirtualisation is an alternative approach, in which the relevant virtual devices are entirely designed with virtualisation in mind - They don't correspond to any existing physical device, and are built with the goal of minimising the overhead - Of course the downside is that new drivers for these paravirtualised devices must be integrated within the guest operating systems -- - Standard framework for PV I/O in Qemu/KVM: **virtio** - PCIe (virtual) devices normally discovered at startup - Optimised ring buffer implementation minimising vmexits ??? - Virtio is the most popular paravirtualised devices framework for Qemu/KVM - It offers virtual PCIe devices optimised for high performance --- # I/O Paravirtualisation (PV)
- Qemu offers PV I/O for network, disk (blk, scsi), console - But also other functionalities that do not correspond to real devices: - Memory hotplug, filesystem sharing, etc. - See https://github.com/qemu/qemu/tree/master/hw/virtio ??? - You have a few examples of virtio devices on the slide, for network, disk, or console - But there are also virtio virtual devices that do not necessarily correspond to real physical hardware, enabling things like memory hotplug, sharing part of the host filesystem with the VM, etc. - Because virtio is so popular, you have drivers for all the devices it proposed already integrated within Linux --- class: center, inverse, middle # Hardware Support for I/O Virtualisation ??? - Because full device emulation is slow, and device paravirtualisation requires compromising equivalence - There are also hardware technologies that were developed to support I/O virtualisation --- # Direct Device Assignment
- Direct assignment/PCI passthrough: **give a VM full and exclusive access to a device** ??? - A first and rather simple solution was to give to a VM direct access to a device, bypassing the hypervisor -- - Bypass the hypervisor to get native performance - 2 fundamental issues: - **Security**: in our example VM3 controls the device which can DMA anywhere in the host physical memory! - **Scalability**: can't have a dedicated device for each VM on a host with many VMs ??? - This is great from the performance and equivalence point of view, but it creates 2 obvious issues - First, security: because the hypervisor is not involved anymore, the VM can freely control the device and in particular it can configure it to DMA everywhere in physical memory - That's a clear breach of the safety criteria - Second, in terms of scalability, each device can only be used by a single VM, so it's not very practical --- # IOMMU - DMA used to bypass the MMU and operate directly on physical memory - So a VM with directly assigned device would be able to: - **Read/write anywhere in physical memory on the host** - **Trigger any interrupt vector** ??? - The security problem is due to the fact that DMA bypasses the MMU and operates directly on physical memory - The VM controlling the device can then read or write anywhere in physical memory, including in the areas allocated to other VMs or to the hypervisor - The VM can also force the device to trigger arbitrary interrupt vector and can possibly inject interrupt into the host or other VMs -- - **IOMMU** (VT-d in Intel CPUs) addresses these security issues with mainly two features: ??? - The solution to these problems is the IOMMU -- - DMAR, **DMA remapping engine**, enforces Page Table and Extended Page Table permissions to prevent a malicious device DMA outside of its allocated memory regions ??? - It contains mainly 2 technologies - First, the ability to enforce the permissions set by page tables and extended page tables to DMA requests - This way we can make sure that a VM with direct device assignment can only access the memory it is allocated -- - IR, **interrupt remapping engine**, route interrupts to the target VMs and prevents a malicious device injecting interrupts into the host/a wrong VM ??? - Second, the interrupt remapping engine routes all interrupts from a given device to the VM which has direct access to this device - This prevents interrupt injection attacks --- # SR-IOV - Addresses the scalability issue of direct device assignment ??? - A second technology is SR-IOV which stands for single root I/O virtualisation - It tackles the scalability issue -- - **A SR-IOV-enabled device can present several instances of itself** - Each assigned to a different VM - The hardware multiplexes itself ??? - A device supporting SR-IOV can present several instances of itself, and each instance can be directly assigned to a VM - Doing so the hardware virtualises and multiplexes itself -- - A device has at least one **Physical Function** controlled by the hypervisor - The instances of itself visible to VMs are called **Virtual Functions** ??? - A device has one physical function which is controlled by the hypervisor, and that allows it to create virtual functions, each representing a virtualised instance of the device directly assigned to a VM -- .leftcol[
] ??? - An example of an SR-IOV enabled network card is illustrated here, with a physical function controlled by the hypervisor, and 2 virtual functions, each assigned directly to a different virtual machine -- .rightcol[ .medium[ - Theoretically SR-IOV can have up to 64K VFs - Recent NICs can create up to 2K VFs (e.g. Mellanox/Nvidia ConnectX-7) ]] ??? - Today's a modern SR-IOV device can create thousands of virtual functions --- # Wrapping Up - I/O virtualisation without hardware support: - **Full device emulation** uses a software model of an existing device - Good for compatibility, bad for performance - **I/O paravirtualisation** uses a software model designed with virtualisation in mind - Better performance, requires guest to install PV drivers - Hardware support: - **IOMMU** allows safe direct device assignment - **SR-IOV** allows scalable device assignment ??? - To sum up, here we first covered I/O virtualisation in software without hardware support - Full device emulation is good for compatibility with existing operating systems, but it is slow - Device paravirtualisation makes things faster by lowering the amount of switches between the guest and the hypervisor, at the cost of breaking compatibility by requiring guest operating systems to use dedicated drivers - We also covered hardware support - It involves directly assigning devices to VMs, in combination with 2 key technologies - The IOMMU to allow safe direct device assignment - And SR-IOV to make things scalable