class: center, middle ### Secure Computer Architecture and Systems *** # Scheduling ??? - Hi everyone, in this video we talk abotu scheduling - And we zoom in on the current scheduler used in Linux - The completely fair scheduler --- # What is OS Scheduling? - **Determines what tasks should run on CPU cores, when, for how long** - With Linux, **threads** are the schedulable entity - Modern OS support running multiple tasks: **multitasking** - \#tasks > #cores, multiplex tasks in time on cores - Gives the illusion that threads are executing in parallel ??? - The OS scheduler in the kernel is the entity that decided what task should run on CPU cores, when, and for how long - Recall that with Linux the schedulable entities are threads, I'll use that term interchangeably with tasks here - Most modern OSes support running multiple tasks at the same time, this is called multitasking - When the number of tasks is superior to the number of cores, which is almost always the case, multiple tasks are multiplexed in time on the same CPU core - This happens quite quickly and it gives the illusion that tasks are executing in parallel -- - Desirable properties: - **Throughput**: run as many tasks as possible, minimise context switch overhead - **Responsiveness**: low overhead, respond quickly or in bounded time to events - **Fairness**: ensure tasks with the same priority get equal shares of CPU time - **Scalability**: scheduler's ability to support many tasks/cores ??? - Scheduling has several key objectives - A first is maximising throughput: we want to run as many tasks as possible, and to minimise the overhead of running scheduling code such as context switches - A second objective is responsiveness: when a task becomes ready to run, for example following an event such as the user pressing a button, we want it to be scheduled on the CPU as early as possible - A third objective is fairness: if we have several tasks with the same priority in the system, each should get an equal share of CPU time - Finally, a last objective is scalability: we want our scheduler to support a large number of tasks and of cores --- # Main Scheduler Classes 2 main classes of schedulers: - **Cooperative scheduling** - A task does not stop running until it decides to do so (yield the CPU) or finishes - The OS cannot enforce fairness e.g. with a malicious task that never yields - Denial of Service: integrity is compromised ??? - As you may know there are two main classes of scheduler - The first one is cooperative scheduling - With a cooperative shcheduler, a task won't stop running until it decides to yield the CPU with a system call, or terminates - In that situation the OS simply cannot enforce fairness - You could for example have a malicious task never yielding and monopolizing all CPU cycles: that's a denial of service and integrity is compromised -- - **Preemptive multitasking** - The OS can interrupt the execution of a task: **preemption** - E.g. after the task expires its timeslice or if a task with higher priority is ready - Necessary when assuming adversarial workloads ??? - The second class of schedulers is preemptive multitasking - With it the kernel can interrupt the execution of a task, this is called preemption - Preemption can happen in various situation, for example when a task expires the amount of CPU time it was allocated by the scheduler, or when a task with a higher priority becomes ready to run - This is obviously more suitable from the securitty point of view with adversarial workloads --- # Traditional Scheduling Algorithms ??? - You may have already learnt in the past about some of the traditiona scheduling algorithms -- - **First-Come, First-Served (FCFS)** - FIFO queue, run tasks in the order they become ready to run until they yield/finish ??? - Like FIFO -- first come first served, in which task simply run in the order they become ready until they yield or finish -- - **Round Robin** - Tasks run for fixed amount of times (quantum) in a round robin fashion ??? - We also have round-robin, in which tasks run for a fixed amount of time, named a quantum, in a round robin manner -- - **Priority Scheduling** - Consider priorities when selecting what task to run ??? - With priority scheduling, the scheduler consider priorities to rank tasks and select the one that needs to run the most -- - **Multilevel Feedback Queues** - Separate ready tasks into multiple queues, favour tasks with short CPU bursts and high I/O bursts - Etc. ??? - Another classic algorithm is multi-level feedback queues, in which tasks are separated into different priority queues and certain types of tasks are favored -- - Not really scalable to modern systems ??? - All these algorithms are great for learning about scheduling, but the thing is none of these approaches is really efficient with modern workloads on modern machines --- ## Modern Systems Need a Better Scheduler ??? - We need a better scheduler for modern systems for 2 main reasons -- - Increasing number of **cores** ??? - First, the last two decades have seen a growing number of cores integrated in CPUs - There is no doubt you are watching this video on a multicore computer -- - **Mixed workloads**: interactive + batch (+ in some scenarios real-time) - **Batch/background tasks** e.g. a video encoder: - CPU/memory-bound, long running - Throughput-oriented: Want to run as much as possible to keep caches warm, but OK to be preempted - **Interactive tasks** e.g. a text editor - I/O bound (keyboard) - Does not necessarily need a lot of CPU but require responsiveness - In some scenarios **real-time tasks**: needs deterministic scheduling latency ??? - Second, the workloads running on modern machines have become increasingly heterogeneous - We have batch or background tasks that are CPU and memory bound and run for a long time - Examples of such tasks are encoding videos or training ML models - These jobs want to run as much as possible to keep caches warm, but they are also OK with being preempted - Conversely we also have interractive tasks, such as text editors or video games - The are latency sensitive, they do not need a lot of CPU cycles but needs to respond quickly to certain events such as mouse clicks or pressing a key on the keyboard - Finally in some scenarios we also have real time task - These need to be schedulable in bounded time so we can get guarantees such as "if the car motion sensor detects an object approaching, the brakes will be activated in less than 1 second" --- # Batch vs. Interactive Tasks
??? - Here we have an illustration of two tasks running on a system - One is a video encoder, it's a background throughput-oriented job - The other is an interractive latency-sensitive application, a text editor - If we had a scheduler ensuring compelte fairness we'll get something like you see on top, with the same amount of CPU time given to each task - But in reality, these tasks have different needs - The text editor only need to run for a few CPU cycles when the user presses a key on the keyboard - The video encoder needs to run as much as possible but is OK to be preempted by the text editor - So a scheduler realising what you see on the bottom of the illustration would be much more efficient -- Goal: maintain good interactive performance **and** good background performance ??? - So overall the goal of modern schedulers is to maintain good performance for both types of jobs, including when they run alongside each other --- # Linux's Scheduler: CFS - Linux's original scheduler did not scale well to many tasks and cores ??? - Linux's original scheduler dates from the 90s, it poorly scaled to high number of tasks and cores -- - Replaced by th O(1) scheduler offering **constant time scheduling decisions** - Good scalability - Issues with interactive tasks ??? - It was replaced in 2003 by the O(1) scheduler which had the ability to take constant time schedulign decisions, independently of the number of tasks and cores in the ssytem - It scaled well, but had some issues with latency-sensitive interractive tasks -- - Current scheduler: **CFS, the Completely Fair Scheduler** - Introduced in Linux **2.6.23** ??? - The current scheduler in Linx is CFS, the competely fair scheduler - It was introduced in 2007 --- # CFS: The Core Idea - First, assume we have only 1 core - CFS defines a fixed-time interval during which each thread in the system must run at least once - Interval is divided into **timeslices**, 1 per thread - Thread's timeslice computed proportionally to its weight (~priority AKA **nice** value) ??? - To understand how CFS works, first assume a CPU with only a single core for the sake of simplicity - CFS defines a fixed time interval during which each thread in the system must run at least once - This interval is split into timeslices, 1 per thread - The length of the timeslice for each thread is proportional to the thread's weight, which is basically its priority, also named nice value in Linux -- - As threads run, track their **vruntime** - Time the thread spent running divided by its weight ??? - The scheduler keeps track of how much time each thread spends on the CPU - Divided by the thread's weight, it gives what is called the vruntime for that thread -- - A thread running is preempted when: - It exceeds its timeslice and there are other threads ready to run - Another thread with a smaller vruntime wakes up ??? - CFS decides to preempt a thread running on the CPU when that thread exceeds its timeslice and there are other threads ready to run - A running thread is also preempted when another thread with a smaller vruntime wakes up -- - Threads ready to run are organised into a **runqueue**: red-black tree ??? - To achieve that, CFS organises all threads that are ready to run into a runqueue, which is a red-black tree --- # CFS Runqueue
- Threads sorted in the RBtree by increasing order of vruntime - Insert/delete/rebalancing/recoloring: O(log n) ??? - It looks like this - Each node in the tree is a thread ready to run - They are sorted in the tree by increasing order of vruntime - This way it is easy for CFS to pick the next task to run: it always corresponds to the leftmost node - Using a red black tree also allows CFS to have low performance overheads, indeed things like inserting and removing nodes, rebalancing and recoloring the tree, are realised with O(log n) complexity --- # CFS on Multiple Cores - **1 runqueue (RBtree) per core** - Using per-core runqueues allows context switches to be done without the need for costly synchronisation between cores - E.g. locks to protect an hypothetical shared runqueue ??? - How does this work with multiple cores? - CFS has 1 runqueue, 1 red black tree, per core in the system - This way the vast majority of scheduling decisions can be made locally in a per-core manner - They do not necessitate inter core communication which would require synchronisation with mechanisms such as lock, that would slow things down too much -- - Still runqueues must be kept **balanced** - Don't want to end up in situations e.g. 1 core with many high priority threads and the other core with a single low priority one ??? - Still, the core runqueues must be kept balanced: we don't want to end up with 1 core having many high priority threads and the other cores with just a few low priority threads in their runqueues -- - Relatively complex **load balancing** algorithm moving threads between runqueues based on: - Thread priorities - Amount of threads in each runqueue - Topology (cache hierarchy, SMT, NUMA) to consider the cost of migrating a thread ??? - To address that problem, CFS implements a relatively complex load balancing algorithm that moves threads between the runqueues of different cores to balance things - This algorithm considered threads priorities, the amount of threads in each runqueue, but also the system's topology including the cache hierarchy, hyperthreading, and NUMA nodes - Indeed there is a cost in migrating a thread away, for example it will have to rebuild its local cache state on the target core --- # Vruntime Explained - `vruntime` increases as task runs - Weighted by task’s **nice value** ??? - The vruntime value for each thread, which is used to rank them in the runqueue, is weighted by their nice value - A thread nice value is basically its priority but inverted: the higher the nice value, the nicest the thread is, i.e. the more it is OK to let other threads run -- - Formula ensures: - CPU-heavy thread increase `vruntime` faster - I/O-bound/interactive thread get scheduled sooner ??? - The way the vruntime is computed for each thread ensures that long-running CPU intensive threads will see their vruntime increase faster, giving more chances to run for I/O bound/interractive threads, that do not run much, when these become ready --- # Preemption and Context Switches - Per-CPU flag set to indicate if **preemption** is required: - When the currently running thread exceeds its timeslice - When a higher priority thread wakes up ??? - With Linux preemption works as follows - When the scheduler decides that the currently-running thread should be preempted, is sets a per-CPU flag to indicate that - Remember this happens when the thread running exceeds its timeslice or when a thread with a higher priority -- with CFS a lower vruntime -- wakes up and becomes ready to run -- - Flag is checked upon returning from interrupt (syscall, exception, hardware interrupt) - If it is set preemption happens ??? - The thread in question is not preempted yet - Instead, the flag is checked by the kernel before returning to user space from interrupts such as system calls, exceptions, or hardware interrupts - If the flag is set, preemption happens -- - If the scheduler decides another thread should run, context switch: - Switches the address space to that of the target thread (page table switch) - Switches the CPU state (registers) ??? - When it does, a context switch will be done - The kernel will switch the CPU context to that of the target thread - That includes a control register that indicates which page table should be used, in effect this switches the address space to that of the target thread --- # Security Aspects - An attacker able to manipulate scheduling parameters can easily compromise the availability of the system - E.g. by setting a higher priority to malicious threads, or selecting a prioritised scheduling policy (e.g. real-time) ??? - There are some security aspects to scheduling - An attacker taking control over a user space application can issue scheduling-related system call and manipulate some scheduling parameters - They could set a higher priority or a prioritised scheudling policy to malicious threads, or to compromise the availability of other applications -- - Solutions: - Prevent untrusted users from accessing scheduling parameters - Use additional CPU isolation mechanisms: control groups - Allow to set CPU quotas/weight for threads - Scheduler-independent - More on control groups later in the unit ??? - The solutions to that issue is to prevent untrusted users from accessing scheduling parameters - We'll see later approaches at filtering the system calls that can be issued by an application that is not trusted - There are also additional CPU isolation mechanisms, in particular a subsytem called control groups in Linux, which is used by containers - Control groups allow to set CPUs quotas for threads, which are independent of the scheduling policy used - We'll talk about control groups more in details when we cover virtualisation --- # Summary - Scheduling has various objectives: throughput, latency, scalability, fairness - Hard to achieve them all given the variety of modern workloads - Latency-sensitive interactive vs. batch tasks - CFS provides fair, scalable, and efficient scheduling tailored for general-purpose Linux systems - Focuses on: - Fair CPU distribution - Low latency for interactive tasks - Simplicity and determinism via vruntime ??? - To sum up, scheduling in operating system kernels has various key objectives: throughput, latency, scalability and fairness - It is difficult to achieve them all given the high heterogeneity of modern workloads, havign both latency sensitive and background tasks - The completely fair scheduler tries to do a good job towards these objectives, given the general purpose nature of Linux - It focues on fair CPU distribution, low latency for interractive task, and maximised throughput for background jobs