Virtualization 101 - I/O Virtualization

class: center, middle
background-image: url(include/title-background.svg)
# .right[Virtualisation 101]
### .right[I/O Virtualisation]

.right[Pierre Olivier  
.right[[pierre.olivier@manchester.ac.uk](mailto:pierre.olivier@manchester.ac.uk)]
]

---
# I/O Interposition

- Same as CPU/Memory, for I/O virtualisation software methods predates
  hardware support

---
# Benefits of I/O Interposition

- **I/O device consolidation**: map multiple virtual devices to a smaller set
  of physical devices
  - Let several VMs share one/a small number of device(s)
  - Increase utilisation and efficiency, reduces costs
--

- **Hypervisor can capture the entire state of the device**
  - Useful for VM suspend/resume
  - and for migration, including between hosts with different physical devices
--

- **Add features not supported by physical devices**: e.g. disk snapshot,
compression, etc.
--

- **Aggregate several physical devices into a single virtual one** for example
  for performance reasons

---
class: center, middle, inverse

# Physical I/O

---
# Physical I/O

.leftcol[
<div style="text-align:center"><img src="include/interractions1.svg" width=400 /></div>
]

.rightcol[.medium[
  - 3 main mechanisms for devices to interact with the system:
]]

---
# Physical I/O

.leftcol[
<div style="text-align:center"><img src="include/interractions2.svg" width=400 /></div>
]

.rightcol[.medium[
  - 3 main mechanisms for devices to interact with the system:
    - **Port-based or Memory-mapped I/O**
        - Basic form, CPU reads/writes to addresses mapped to device registers
    - **Direct Memory Access**
        - CPU initiate transfer and is later notified when it completes
    - **Interrupts**
        - Sent by devices to the CPU for notification
]]

---
# Data Transfer through Ring Buffers

.left3col[
<div style="text-align:center"><img src="include/ring-buffers1.svg" width=500 /></div>
]

.right3col[.medium[
**Ring buffer: producer-consumer circular buffer**
- Area in memory for CPU/device streaming communication
- Described in device's registers setup by the CPU using MMIO/PIO
]]

---
# Data Transfer through Ring Buffers

.left3col[
<div style="text-align:center"><img src="include/ring-buffers2.svg" width=500 /></div>
]

.right3col[.medium[
- In the case of CPU to device transfer:
  - Device consumes from `head` and update the pointer
]]

---
# Data Transfer through Ring Buffers

.left3col[
<div style="text-align:center"><img src="include/ring-buffers3.svg" width=500 /></div>
]

.right3col[.medium[
- In the case of CPU to device transfer:
  - Device consumes from `head` and update the pointer
  - CPU produces (through mem. controller) at `tail` and update the pointer
]]
--

Can have 2 separate buffers for transmit/receive (e.g. network adapter) or a single one (e.g. disk)

---
# Data Transfer through Ring Buffers

.left3col[
<div style="text-align:center"><img src="include/ring-buffers4.svg" width=500 /></div>
]

.right3col[.medium[
- **Synchronisation** happens through:
  - MMIO (CPU to device, e.g. to initialise device, start a DMA transfer, etc.)
  - Interrupts (device to CPU, e.g. to signal the end of a DMA transfer, device-level events, etc.)
]]

---
class: center, inverse, middle

# I/O Virtualisation without Hardware Support

---
# I/O Emulation

- The OS ⇔ device interface is in essence simple:
  - OS discovers and controls devices with MMIO/PIO
  - Devices respond with interrupts and DMA
--

- **The hypervisor can completely emulate devices by:**
  - Making sure that every MMIO/PIO operation traps
      - PIO are made through sensitive instructions, will trap
      - Map I/O memory as RO/not mapped
--
  - Injecting interrupts to the guest
      - By calling the handlers registered by the guest in a virtual interrupt
        controller
--
  - Reading/writing from/to guest physical memory to emulate DMA

---
# Qemu/KVM & I/O Virtualisation

- Qemu/KVM will spawn:
  - **1 thread per vCPU** (virtual core)
  - **1 thread per virtual device**
- E.g., VM with 2 vCPUs and 2 devices:

---
# Qemu/KVM & I/O Virtualisation

- Device threads handle most I/O operations asynchronously

---
# Qemu/KVM & I/O Virtualisation

- `lspci` output in a VM with 2 Network Interface Controllers (NICs):

```
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
*00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
*00:04.0 Ethernet controller: Red Hat, Inc Virtio network device
00:05.0 Communication controller: Red Hat, Inc Virtio console
```

Illustrates the 2 main I/O virtualisation methods without hardware support:
- **Full emulation** (Intel NIC)
- **Paravirtualised I/O** (virtio NIC)

---
# Full Device Emulation

- Detailed view for the Intel virtual NIC:

```
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
	Subsystem: Red Hat, Inc QEMU Virtual Machine
	Physical Slot: 3
	Flags: bus master, fast devsel, latency 0, IRQ 11
*   Memory at febc0000 (32-bit, non-prefetchable) [size=128K]
	I/O ports at c000 [size=64]
	Expansion ROM at feb40000 [disabled] [size=256K]
*   Kernel driver in use: e1000
	Kernel modules: e1000
```

- 82540EM: very old but widespread NIC, any modern OS will have the corresponding driver (e1000)
- Registers mapped in physical memory at address `0xfebc0000`

---
## Full Device Emulation: e1000 registers

.small[
| Category | Name | Abbreviates | Offset | Description |
| -------- | ---- | ----------- | ------ | ----------- |
| Receive  | `RDBAH` | Receive descriptor base address | `0x02800` | Base address of Rx ring |
| Receive  | `RDLEN` | Receive descriptor length | `0x02808` | Rx ring size |
| Receive  | `RDH`   | Receive descriptor head | `0x02810` | Pointer to head of Rx ring |
| Receive  | `RDT`   | Receive descriptor tail | `0x02818` | Pointer to tail of Rx ring |
| Receive | `TDBAH` | Transmit descriptor base address | `0x03800` | Base address of Tx ring |
| ... | ... | ... | ... | ... |
| Other    | `STATUS` | Status | `0x00008` | Current device status |
| Other    | `ICR` | Interrupt cause read | `0x000C0` | Cause of the last interrupt |
| ... | ... | ... | ... | ... |

For example to read the cause of the last interrupt (`ICR` command), CPU reads at physical address base + offset: `0xfebc0000` + `0xc0` == `0xfebc00c0`

]

---
# Full Device Emulation

- Trap to KVM and then Qemu for each interaction with the device' registers
  - Qemu completely emulates in software the behaviour of the NIC
  - Implemented in `hw/net/e1000.c` (https://github.com/qemu/qemu/blob/master/hw/net/e1000.c)
- For example for `ICR` (see the NIC's behaviour description in the data sheet here: https://bit.ly/3GCubZu):

```c
static uint32_t mac_icr_read(E1000State *s, int index)
{
    uint32_t ret = s->mac_reg[ICR];

set_interrupt_cause(s, 0, 0);
    return ret;
}
```

---
# I/O Paravirtualisation (PV)

- **Most hardware devices have not been designed with virtualisation in mind**
  - E.g. sending/receiving a single Ethernet frame with e1000 involves multiple
    register accesses, i.e. several costly vmexits
--
- **I/O paravirtualisation: virtual devices implemented with the goal of
  minimising overhead**
  - At the cost of installing PV drivers in the guest, i.e. modifying the guest OS
--

- Standard framework for PV I/O in Qemu/KVM: **virtio**
  - PCIe (virtual) devices normally discovered at startup
  - Optimised ring buffer implementation minimising vmexits

---
# I/O Paravirtualisation (PV)

- Qemu offers PV I/O for network, disk (blk, scsi), console
- But also other functionalities that do not correspond to real devices:
  - Memory hotplug, filesystem sharing, etc.
  - See https://github.com/qemu/qemu/tree/master/hw/virtio

---
# PV I/O and Type-I Hypervisors

.leftcol[
<div style="text-align:center"><img src="include/pv-driver.svg" width=400 /></div>
]

.rightcol[
- Can't implement driver for every device in Xen
- Existing OS (Linux) in Dom 0 already has huge driver database
- Use PV I/O and split driver model to implement a **backend** driver connecting to the read driver
  - Only need to implement a single disk/network PV driver (**frontend**) in guest
]

---
class: center, inverse, middle

# Hardware Support for I/O Virtualisation

---
# Direct Device Assignment

- Direct assignment/PCI passthrough: **give a VM full and exclusive access to a device**
--

- Bypass the hypervisor to get native performance
  - 2 fundamental issues:
      - **Security**: in our example VM3 controls the device which can DMA anywhere in the host physical memory!
      - **Scalability**: can't have a dedicated device for each VM on a host with many VMs

---
# IOMMU

- DMA used to bypass the MMU and operate directly on physical memory
- So a VM with directly assigned device would be able to:
  - **Read/write anywhere in physical memory on the host**
  - **Trigger any interrupt vector**
--

- **IOMMU** (VT-d in Intel CPUs) addresses these security issues with mainly two
  features:
--

- DMAR, **DMA remapping engine**, enforces Page Table and Extended Page Table
    permissions to prevent a malicious device DMA outside of its allocated
    memory regions
--
      - IR, **interrupt remapping engine**, route interrupts to the target VMs and
    prevents a malicious device injecting interrupts into the host/a wrong VM

---

# SR-IOV

- Addresses the scalability issue of direct device assignment
--

- **A SR-IOV-enabled device can present several instances of itself**
  - Each assigned to a different VM
  - The hardware multiplexes itself
--

- A device has at least one **Physical Function** controlled by the hypervisor
 -  The instances of itself visible to VMs are called **Virtual Functions**

.leftcol[
<div style="text-align:center"><img src="include/sriov.svg" width=400 /></div>
]

.rightcol[
.medium[
- Theoretically SR-IOV can have up to 64K VFs
  - Recent NICs can create up to 2K VFs (e.g. Mellanox/Nvidia ConnectX-7)
]]

---
class: inverse, middle, center

# Wrapping Up

---
# Wrapping Up

- I/O virtualisation without hardware support:
--

- **Full device emulation** uses a software model of an existing device
      - Good for compatibility, bad for performance
--
  - **I/O paravirtualisation** uses a software model designed with virtualisation in mind
      - Better performance, requires guest to install PV drivers
--
- Hardware support:
  - **IOMMU** allows safe direct device assignment
  - **SR-IOV** allows scalable device assignment