class: center, middle background-image: url(include/title-background.svg) # .right[Virtualisation 101] ### .right[I/O Virtualisation] .right[Pierre Olivier .right[[pierre.olivier@manchester.ac.uk](mailto:pierre.olivier@manchester.ac.uk)] ] --- # I/O Interposition - Same as CPU/Memory, for I/O virtualisation software methods predates hardware support
--- # Benefits of I/O Interposition - **I/O device consolidation**: map multiple virtual devices to a smaller set of physical devices - Let several VMs share one/a small number of device(s) - Increase utilisation and efficiency, reduces costs -- - **Hypervisor can capture the entire state of the device** - Useful for VM suspend/resume - and for migration, including between hosts with different physical devices -- - **Add features not supported by physical devices**: e.g. disk snapshot, compression, etc. -- - **Aggregate several physical devices into a single virtual one** for example for performance reasons --- class: center, middle, inverse # Physical I/O --- # Physical I/O .leftcol[
] .rightcol[.medium[ - 3 main mechanisms for devices to interact with the system: ]] --- # Physical I/O .leftcol[
] .rightcol[.medium[ - 3 main mechanisms for devices to interact with the system: - **Port-based or Memory-mapped I/O** - Basic form, CPU reads/writes to addresses mapped to device registers - **Direct Memory Access** - CPU initiate transfer and is later notified when it completes - **Interrupts** - Sent by devices to the CPU for notification ]] --- # Data Transfer through Ring Buffers .left3col[
] .right3col[.medium[ **Ring buffer: producer-consumer circular buffer** - Area in memory for CPU/device streaming communication - Described in device's registers setup by the CPU using MMIO/PIO ]] --- # Data Transfer through Ring Buffers .left3col[
] .right3col[.medium[ - In the case of CPU to device transfer: - Device consumes from `head` and update the pointer ]] --- # Data Transfer through Ring Buffers .left3col[
] .right3col[.medium[ - In the case of CPU to device transfer: - Device consumes from `head` and update the pointer - CPU produces (through mem. controller) at `tail` and update the pointer ]] -- Can have 2 separate buffers for transmit/receive (e.g. network adapter) or a single one (e.g. disk) --- # Data Transfer through Ring Buffers .left3col[
] .right3col[.medium[ - **Synchronisation** happens through: - MMIO (CPU to device, e.g. to initialise device, start a DMA transfer, etc.) - Interrupts (device to CPU, e.g. to signal the end of a DMA transfer, device-level events, etc.) ]] --- class: center, inverse, middle # I/O Virtualisation without Hardware Support --- # I/O Emulation - The OS ⇔ device interface is in essence simple: - OS discovers and controls devices with MMIO/PIO - Devices respond with interrupts and DMA -- - **The hypervisor can completely emulate devices by:** - Making sure that every MMIO/PIO operation traps - PIO are made through sensitive instructions, will trap - Map I/O memory as RO/not mapped -- - Injecting interrupts to the guest - By calling the handlers registered by the guest in a virtual interrupt controller -- - Reading/writing from/to guest physical memory to emulate DMA --- # Qemu/KVM & I/O Virtualisation - Qemu/KVM will spawn: - **1 thread per vCPU** (virtual core) - **1 thread per virtual device** - E.g., VM with 2 vCPUs and 2 devices:
--- # Qemu/KVM & I/O Virtualisation
- Device threads handle most I/O operations asynchronously --- # Qemu/KVM & I/O Virtualisation - `lspci` output in a VM with 2 Network Interface Controllers (NICs): ``` 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Device 1234:1111 (rev 02) *00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03) *00:04.0 Ethernet controller: Red Hat, Inc Virtio network device 00:05.0 Communication controller: Red Hat, Inc Virtio console ``` Illustrates the 2 main I/O virtualisation methods without hardware support: - **Full emulation** (Intel NIC) - **Paravirtualised I/O** (virtio NIC) --- # Full Device Emulation - Detailed view for the Intel virtual NIC: ``` 00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03) Subsystem: Red Hat, Inc QEMU Virtual Machine Physical Slot: 3 Flags: bus master, fast devsel, latency 0, IRQ 11 * Memory at febc0000 (32-bit, non-prefetchable) [size=128K] I/O ports at c000 [size=64] Expansion ROM at feb40000 [disabled] [size=256K] * Kernel driver in use: e1000 Kernel modules: e1000 ``` - 82540EM: very old but widespread NIC, any modern OS will have the corresponding driver (e1000) - Registers mapped in physical memory at address `0xfebc0000` --- ## Full Device Emulation: e1000 registers .small[ | Category | Name | Abbreviates | Offset | Description | | -------- | ---- | ----------- | ------ | ----------- | | Receive | `RDBAH` | Receive descriptor base address | `0x02800` | Base address of Rx ring | | Receive | `RDLEN` | Receive descriptor length | `0x02808` | Rx ring size | | Receive | `RDH` | Receive descriptor head | `0x02810` | Pointer to head of Rx ring | | Receive | `RDT` | Receive descriptor tail | `0x02818` | Pointer to tail of Rx ring | | Receive | `TDBAH` | Transmit descriptor base address | `0x03800` | Base address of Tx ring | | ... | ... | ... | ... | ... | | Other | `STATUS` | Status | `0x00008` | Current device status | | Other | `ICR` | Interrupt cause read | `0x000C0` | Cause of the last interrupt | | ... | ... | ... | ... | ... | For example to read the cause of the last interrupt (`ICR` command), CPU reads at physical address base + offset: `0xfebc0000` + `0xc0` == `0xfebc00c0` ] --- # Full Device Emulation - Trap to KVM and then Qemu for each interaction with the device' registers - Qemu completely emulates in software the behaviour of the NIC - Implemented in `hw/net/e1000.c` (https://github.com/qemu/qemu/blob/master/hw/net/e1000.c) - For example for `ICR` (see the NIC's behaviour description in the data sheet here: https://bit.ly/3GCubZu): ```c static uint32_t mac_icr_read(E1000State *s, int index) { uint32_t ret = s->mac_reg[ICR]; set_interrupt_cause(s, 0, 0); return ret; } ``` --- # I/O Paravirtualisation (PV) - **Most hardware devices have not been designed with virtualisation in mind** - E.g. sending/receiving a single Ethernet frame with e1000 involves multiple register accesses, i.e. several costly vmexits -- - **I/O paravirtualisation: virtual devices implemented with the goal of minimising overhead** - At the cost of installing PV drivers in the guest, i.e. modifying the guest OS -- - Standard framework for PV I/O in Qemu/KVM: **virtio** - PCIe (virtual) devices normally discovered at startup - Optimised ring buffer implementation minimising vmexits --- # I/O Paravirtualisation (PV)
- Qemu offers PV I/O for network, disk (blk, scsi), console - But also other functionalities that do not correspond to real devices: - Memory hotplug, filesystem sharing, etc. - See https://github.com/qemu/qemu/tree/master/hw/virtio --- # PV I/O and Type-I Hypervisors .leftcol[
] .rightcol[ - Can't implement driver for every device in Xen - Existing OS (Linux) in Dom 0 already has huge driver database - Use PV I/O and split driver model to implement a **backend** driver connecting to the read driver - Only need to implement a single disk/network PV driver (**frontend**) in guest ] --- class: center, inverse, middle # Hardware Support for I/O Virtualisation --- # Direct Device Assignment
- Direct assignment/PCI passthrough: **give a VM full and exclusive access to a device** -- - Bypass the hypervisor to get native performance - 2 fundamental issues: - **Security**: in our example VM3 controls the device which can DMA anywhere in the host physical memory! - **Scalability**: can't have a dedicated device for each VM on a host with many VMs --- # IOMMU - DMA used to bypass the MMU and operate directly on physical memory - So a VM with directly assigned device would be able to: - **Read/write anywhere in physical memory on the host** - **Trigger any interrupt vector** -- - **IOMMU** (VT-d in Intel CPUs) addresses these security issues with mainly two features: -- - DMAR, **DMA remapping engine**, enforces Page Table and Extended Page Table permissions to prevent a malicious device DMA outside of its allocated memory regions -- - IR, **interrupt remapping engine**, route interrupts to the target VMs and prevents a malicious device injecting interrupts into the host/a wrong VM --- # SR-IOV - Addresses the scalability issue of direct device assignment -- - **A SR-IOV-enabled device can present several instances of itself** - Each assigned to a different VM - The hardware multiplexes itself -- - A device has at least one **Physical Function** controlled by the hypervisor - The instances of itself visible to VMs are called **Virtual Functions** -- .leftcol[
] -- .rightcol[ .medium[ - Theoretically SR-IOV can have up to 64K VFs - Recent NICs can create up to 2K VFs (e.g. Mellanox/Nvidia ConnectX-7) ]] --- class: inverse, middle, center # Wrapping Up --- # Wrapping Up - I/O virtualisation without hardware support: -- - **Full device emulation** uses a software model of an existing device - Good for compatibility, bad for performance -- - **I/O paravirtualisation** uses a software model designed with virtualisation in mind - Better performance, requires guest to install PV drivers -- - Hardware support: - **IOMMU** allows safe direct device assignment - **SR-IOV** allows scalable device assignment