class: center, middle ### Secure Computer Architecture and Systems *** # Memory Management ??? - Hi everyone, here we are going to cover another key responsibility of operating systems - Memory management --- # Memory Management - **Memory management** is the set of OS features managing allocation and accesses to memory - Memory allocation/deallocation for applications and for the kernel - Setting up and maintaining address spaces for processes and for the kernel - Enforcing memory protection (isolation) between processes and the kernel, and in between processes - Swapping - Etc. ??? - It corresponds to the set of features implemented by the operating system to manage memory allocation, protection, and accesses - Things like allocation and deallocation of memory for applications and for the kernel - Address spaces management - Isolating the memory belonging to different processes, and isolating the kernel's memory from access by processes - Swapping, which is the use of secondary storage as main memory - And more --- # Virtual Memory .leftcol[ - CPU accesses memory with load and store instructions ] .rightcol[
] ??? - You are probably already aware of the concept of virtual memory, so just a quick refresher here - The CPU accesses memory with load and store instructions --- # Virtual Memory .leftcol[ - CPU accesses memory with load and store instructions - At boot time the OS enable virtual memory: every load/store now hits a virtual address - MMU translates **transparently** to physical addresses ] .rightcol[
] ??? - In the vast majority of modern systems at boot time the CPU enables virtual very memory early during the boot process - From that moment all the addresses targeted by loads and stores will be virtual, the CPU won't address physical memory directly anymore - The translation between the virtual addresses the CPU request to access and the actual physical memory they correspond to is done transparently by the MMU --- # Virtual Memory .leftcol[ - CPU accesses memory with load and store instructions - At boot time the OS enable virtual memory: every load/store now hits a virtual address - MMU translates **transparently** to physical addresses - Old implementation: **segmentation** ] .rightcol[
] ??? - An old implementation of virtual memory is segmentation - An app will get access to a relatively small virtual address space, a segment, which size is a subset of the total RAM size - That address space is then mapped to physical memory contiguously: in effect the address translation just consists in adding an offset to a virtual address to obtain the corresponding physical one --- # Virtual Memory .leftcol[ - CPU accesses memory with load and store instructions - At boot time the OS enable virtual memory: every load/store now hits a virtual address - MMU translates **transparently** to physical addresses - Old implementation: **segmentation** ] .rightcol[
] ??? - Each process would get its own segment, mapped at different locations in physical memory to ensure isolation between processes - Overall segmentation was not very flexible and brought issues such as fragmentation --- # Virtual Memory .leftcol[ - CPU accesses memory with load and store instructions - At boot time the OS enable virtual memory: every load/store now hits a virtual address - MMU translates **transparently** to physical addresses - Old implementation: **segmentation** - Today: **paging** ] .rightcol[
] ??? - Rather than segmentation, the vast majority of modern systems use paging to implement virtual memory - With paging, almost the entirety of the space addressable given the width of the memory address bus is accessible to make up each process address space - For example most Intel 64 bits CPUs virtual addresses have a size of 48 bits, which gives a virtual address space of 256 TB for each process - Of course most of that address space is not mapped to physical memory - With paging the mapping is achieved at the granularity of 4KB pages - A data structure called the page table defines what virtual pages are mapped to physical memory, and where to --- # Virtual Memory .leftcol[ - CPU accesses memory with load and store instructions - At boot time the OS enable virtual memory: every load/store now hits a virtual address - MMU translates **transparently** to physical addresses - Old implementation: **segmentation** - Today: **paging** ] .rightcol[
] ??? - Each process has a different page table, and without establishing shared memory processes do not share physical pages - This way they are fully isolated from each other --- # Paging - Virtual memory mapped to physical memory at the granularity of a **page** (4KB) - Address translation for 1 process defined by its **page table** - Indicates what virtual pages are mapped to what physical pages - 1 page table == 1 address space, so there is 1 per process ??? - As I was saying with paging virtual memory is mapped to physical memory at the granularity of 4KB pages - The page table indicates what virtual pages are mapped to what physical pages - There is one page table, defining one address space, per process in the system -- - Page tables are 1. **Set up/controlled by the OS** 2. **Walked transparently by the MMU** when the CPU performs loads and stores ??? - Concretely, the way a page table handling the address space of a process work is as follows 1. The OS sets up the page table when the process is created. The OS also maintains the page table when new mappings need to be added/removed, for example when the process loads a shared library or allocates memory. 2. A page table installed is walked transparently by the MMU to achieve the translation when the CPU runs the process in question and accesses memory. --- # Page Tables - Using a linear array with one entry per virtual page would be highly inefficient - 64 bit virtual address space is very sparse, most pages are not mapped ??? - Intuitively you may think that the page table is a large linear array with a slot per virtual page and in each slot translation information regarding that page - That would be a huge waste of memory because modern 64 bit address spaces are very large, there are many pages, but most of them are not mapped -- - Instead, use a **tree** of pages holding page table data - 4 levels on most 64 bit modern CPUs ??? - Instead, the page table is a tree - The tree is made of pointers linking together special pages in physical memory used for address translation - Using a tree means that the system can avoid storing a lot of translation information corresponding to the large areas of the address space that are not mapped to physical memory - On modern CPUs the page table generally has 4 levels of pointers, although we are starting to see some CPUs with 5 -- - Root address of the tree held in a register - To change address space during context switch: simply switch that register to the root of another tree ??? - The root address of the tree corresponding to the page table for the process currently executing is held in a control register - So changing address space during a context switch is easy: simply write in that register the address of the root of the target page table --- # Page Tables (2)
??? - This illustrates a typical 4-level page table - The address of the root page is held in a control register on the CPU, for Intel x86-64 it's %cr3 - The root node represents the 4th level of the page table, and the entries it contains reference pages of the 3rd level - Entries at the 3rd level reference pages of the second level, and their entries reference pages of the first level - Finally, entries at the first level reference the physical pages holding the data accessed by the CPU --- # Page Tables (2)
??? - As I mentioned, each translation page contains pointers to the next level - The size of a page is 4KB, so there is enough space for 512 pointers - Each pointer may be either present, meaning it corresponds to a range of virtual address space that is mapped, and its value refers to a page at the lower level - Or absent, meaning it corresponds to a range of the virtual address space that is not mapped - Note that all pointers in translation pages refer to physical addresses --- # Page Tables (2)
??? - When a page table is installed and the CPU issues loads and store, the page table is walked transparently by the MMU to perform the translation - For example if the CPU issues a load at address x, the MMU follows the path of pointers indexed by x, and the data read by this load operation will be the bit hit in the data page at the end of the walk --- # Page Tables Walk
??? - Let's see how the entries in the page table are indexed during a page table walk - When the CPU issues a load or a store it targets a particular virtual address - We have an example of 64 bits virtual address depicted in binary on the slide --- # Page Tables Walk
??? - As mentioned with x86-64 the %cr3 control registers holds the address of the root of the page table, which is the 4th level --- # Page Tables Walk
??? - The bits 39 to 47 of the target virtual address are used to index an entry in the root page, which gives us the pointer to the 3rd level translation page --- # Page Tables Walk
??? - Then the bits 30 to 38 of the target virtual address index an entry in the 3rd level translation page, which gives us the pointer to the 2nd level page --- # Page Tables Walk
??? - The bits 21 to 29 of the virtual address index an entry in the 2nd level translation page, giving the pointer to the 1st level page --- # Page Tables Walk
??? - The bits 12 to 20 of the virtual address index an entry in the 1st level translation page, giving the pointer to the physical page the CPU wants to access --- # Page Tables Walk
??? - And finally the last bits of the virtual address, bits 0 to 11, index a byte within that physical page --- # Page Table Entries (x86-64) - Each page of the page table holds 512 64-bit entries - Lower levels referenced by their page index in the 48-bit physical address space supported
??? - As I mentioned each page contains 512 entries, each of size 64 bits - We don't need the full 64 bits of an entry to index the lower levels though - First the virtual address space on most 64 bits processor is rather indexed on 48 bits - Second, the pointers in the translation page do not refer exactly to physical addresses, but rather to physical page index - Obviously there are less physical pages than there are physical addresses, so we need less bit to index pages - Overall we only need 36 bits for each entry, meaning that we can use the additional bits to hold metadata about the range of the virtual address space referenced by each entry in translation pages -- - Also contain **metadata** regarding the memory referenced: - Present, read/write, user/supervisor - Allow to control if accesses to the memory referenced will succeed or fault (exception) - Used to implement **memory protection**, but also CoW address space transfer, swap, etc. ??? - This metadata is used to indicate if the range of address space concerned is actually mapped, if it is accessible in read and/or write mode, and if it is only accessible in supervisor mode or in user mode - This allows to control memory accesses: if an address in the virtual address space is not present or if the access in question is denied, for example because it's read only and accessed in write mode, the CPU will trigger a page fault exception - This is crucial to the security of the system, and it is used to implement memory protection, but also things like swap and the on-demand duplication of the address space upon fork --- # The OS and the Address Space - User/supervisor memory protection allows the kernel to be mapped in the address space of each process - **No need to switch page table upon system calls**
??? - With the bit in the page table entries allowing to set part of the address space as accessible in supervisor mode only, we can actually have the kernel live in the same address space as processes - So with Linux, the kernel is mapped in the top part of the address space of each process as illustrated on the slides - Every page table is configured so that this area is accessible in supervisor mode only, i.e. by the kernel only - This has an important advantage: there is no need to switch page tables upon system calls - Switching page table is very costly because it involves a flush of the translation cache, the translation lookaside buffer -- - Mechanisms enforcing the main memory protection security invariants: - Applications isolated from each other by having **different page tables** - Kernel isolated from apps with **supervisor protection in each address space** ??? - To sum up, the mechanisms enforcing the main memory security invariants in the system are twofold - Processes are isolated from each other by having different page tables defining different, non overlapping address space - And the kernel is isolated from processes through the supervisor only access bit in each page table --- # Kernel Address Space
- Some important areas: - **dirmap**: Direct linear mapping of all physical memory - **vmalloc area**: serve certain kernel dynamic memory allocation (`vmalloc`) requests - **Kernel code and static memory** (`.data`, `.bss`, etc.): similar to a traditional program - **Modules**: pieces of kernel code that can be loaded/unloaded dynamically without rebooting ??? - If we zoom in on the part of the address space reserved for the kernel, it is made of many different areas - I don't have time to go over each area here, but here are the main ones - You have the dirmap, which is a direct mapping of all physical memory - It is useful when the kernel wants to access physical memory directly, for example when setting up page tables, or when allocating memory that needs to be contiguous in physical memory - The vmalloc area is basically the kernel heap, containing memory allocated dynamically - Like a standard program, the kernel also has a static memory part with its code and static data, these were mapped from the kernel's binary at boot time - Finally, Linux supports the dynamic loading and unloading of kernel code at runtime, in the forms of kernel modules - These are loaded in a specific area of the kernel part of the address space --- # Memory Allocation in the Kernel .leftlargecol[ When the kernel needs some memory the following needs to happen: ] .rightsmallcol[
] ??? - When the kernel need to allocate memory for itself or for an application, the following needs to happen --- # Memory Allocation in the Kernel .leftlargecol[ When the kernel needs some memory the following needs to happen: - Reserve some free physical memory ] .rightsmallcol[
] ??? - The kernel first reserves some physical memory, enough space to satisfy the allocation request, let's assume here it's 2 pages, and they don't need to be contiguous in physical memory --- # Memory Allocation in the Kernel .leftlargecol[ When the kernel needs some memory the following needs to happen: - Reserve some free physical memory - If it's not already mapped: - Find a free range of virtual memory ] .rightsmallcol[
] ??? - The kernel also needs to find a free range of virtual memory, either within its own area of the address space if the allocation request originates from the kernel, or within the process-accessible part of the address space if it's the process requesting memory --- # Memory Allocation in the Kernel .leftlargecol[ When the kernel needs some memory the following needs to happen: - Reserve some free physical memory - If it's not already mapped: - Find a free range of virtual memory - Create page table entries corresponding to that range of virtual memory ] .rightsmallcol[
] ??? - The kernel then creates the page table entries corresponding to the newly created virtual pages --- # Memory Allocation in the Kernel .leftlargecol[ When the kernel needs some memory the following needs to happen: - Reserve some free physical memory - If it's not already mapped: - Find a free range of virtual memory - Create page table entries corresponding to that range of virtual memory - Map the virtual memory to the physical pages (possibly later on demand) ] .rightsmallcol[
] ??? - And the page table entries are set up to point to the physical pages that were previously reserved - Note that this mapping will in most case not be done at allocation time, but rather later when the CPU access the virtual pages for the first time - This is achieved by unsetting the present bit in the page table entries: the first access will trigger a page fault and at that time the kernel can perform the mapping and restart the memory access --- # Memory Allocation in the Kernel .leftlargecol[ When the kernel needs some memory the following needs to happen: - Reserve some free physical memory - If it's not already mapped: - Find a free range of virtual memory - Create page table entries corresponding to that range of virtual memory - Map the virtual memory to the physical pages (possibly later on demand) - Return a pointer to the virtual area ] .rightsmallcol[
] ??? - The memory allocation request finally returns a pointer to the newly allocated virtual area --- # Memory Allocation in the Kernel .rightcol[
] ??? - All memory allocations are served by the kernel - If they come from user space, it is done through the mmap() system call - Note that malloc is implemented in user space by the libc - It calls mmap under the hood to get a large area of virtual memory, then split it into smaller buffers to serve allocation requests --- # `kmalloc` and `vmalloc` .leftcol[ **2 main interfaces:** - `kmalloc`: small size, fast and customised allocations - Usable in contexts where code cannot sleep (interrupt) - Memory returned already mapped (dirmap) - `vmalloc`: large allocations (page granularity) - Page table modified, memory mapped on demand ] .rightcol[
] ??? - The kernel itself has two main dynamic memory allocation interfaces - The first is kmalloc, it is used for fast, small-sized allocations - It is usable in context where kernel code cannot sleep, for example when handling an interrupt - It also returns memory that is already mapped and always physically contiguous - The other interface is vmalloc, it is used for larger allocations at page granularity - It is slower as it requires updating the page table --- # SLAB Allocator .leftcol[ **SLAB layer:** - System of caches trying to reuse same-size allocations as much as possible - Good for performance and to reduce fragmentation - Useful when high number of data structures of the same type are allocated frequently ] .rightcol[
] ??? - kmalloc relies on the SLAB layer which is a system of caches trying to reuse same-size allocations as much as possible - This is good for speed, but also it reduces fragmentation - It is useful when many data structures of the same type are allocated frequently - Kernel code can also directly create its own SLAB caches without going through kmalloc --- # Physical Page Allocator .leftcol[ **Page allocator: buddy system** - Allocates physical memory at the granularity of pages - Maintain lists of blocks of same-size power-of-two contiguous free pages to limit fragmentation - Blocks can be split (into buddies) and merged as needed ] .rightcol[
] ??? - To reserve physical memory, all allocation methods rely on the buddy system, also called the frame allocator - Here a frame means a physical page - This is the granularity at which the buddy system allocates physical memory - The buddy system maintains list of blocks of same-size sets of contiguous free physical page, with the goal of limiting fragmentation - Large blocks can be split and merged as needed --- # Memory Allocation in the Kernel .leftcol[ **Page allocator: buddy system** - Allocates physical memory at the granularity of pages - Maintain lists of blocks of same-size power-of-two contiguous free pages to limit fragmentation - Blocks can be split (into buddies) and merged as needed ] .rightcol[
] ??? - You can see the way the buddy system works here, with lists linking blocks of 1, 2, 4, 8, etc. contiguous physical pages --- # Summary - Memory management is a key feature provided by OSes - Tied to key OS security invariants: - A process memory must not be accessible from other processes - Achieved by having **disjoint virtual address spaces** - The kernel memory must not be accessible from processes - Achieved with **user/supervisor memory protection** - Kernel also handles **memory allocation** for itself and applications ??? - And that's it for memory management - It's a key feature of every operating system - The relevant security invariants here are first that a process' memory should not be accessible from other processes - This is achieved by establishing disjoint address spaces for each process, with 1 page table per process - Second, the kernel memory that lives in the processes address spaces should not be directly accessible by processes - This is achieved with supervisor memory protection, using the supervisor bit in the page table entries - We also saw how the kernel handles memory allocation both for itself and for processes