COMP26020p1 - Case Study: Operating System Kernel

class: center, middle

### COMP26020 Programming Languages and Paradigms Part 1: C Programming
***
# Case Study: Operating System Kernel

---
# System Call and Context Switch

- C is the default language to write OS kernels
- Let's study the implementation of:
  - **System calls handling**
  - **Context switch** (replacement of a task running on the CPU by another one)

- Theory first, then implementation in the Linux kernel

???

- Because of its proximity with the hardware and its good performance, C is heavily used to implement OS kernels
- Let's have a look at how two of the key OS operations, **system call handling** and **context switches**, are implemented
- We already talked about invoking system calls from user spacein the previous video regarding the C standard library
- Here we'll get an overview of how system calls are handled on the kernel side
- Regarding context switches, it corresponds to the action of replacing a task running on the CPU by another one
- We'll first study the general principles regarding these two operations, then we'll have a look about how a popular operating system, the Linux kernel, implements them

---
class: center, middle, inverse
# System Calls in a Nutshell

???

- Let's see what happens when a system call is made

---
# System Call in a Nutshell
<div style="text-align:center"><img src="include/syscall-detailed1.svg" width=800 /></div>

???
- Let's assume that a task is executing on the processor
- It is a normal user space program which means the CPU runs in a mode of execution named **user mode**
- In user mode the CPU cannot execute privileged operations such as installing a new page table or shutting itself down
- This is for security reasons: only the kernel should be able to perform these tasks
- So at a given point during this program's execution, the **CPU state** for this task consists in the values present in the processor registers.
- We have general purpose registers used for computations, but also special-purpose ones such as the stack and instruction pointers.
- The stack pointer references a contiguous area of memory named the stack, used to store various things such as local variables.
- The instruction pointer references the instruction currently executed by the CPU, located inside the program code in memory.

---
# System Call in a Nutshell
<div style="text-align:center"><img src="include/syscall-detailed2.svg" width=800 /></div>

???

- When the `syscall` instruction is issued, the CPU switches to a mode of execution named **supervisor mode**.
- In this mode all CPU functions are enabled, including privileged operations.
- The CPU also jumps to a predefined handler in the kernel, and switches to a kernel stack.

---
# System Call in a Nutshell
<div style="text-align:center"><img src="include/syscall-detailed3.svg" width=800 /></div>

???

- The kernel starts by saving the application state by pushing all registers on the kernel stack.
- We'll restore it later when the application resumes: doing so the interruption that corresponds to handling the system call will have been completely transparent to the application's execution.
---
# System Call in a Nutshell
<div style="text-align:center"><img src="include/syscall-detailed4.svg" width=800 /></div>

???
- At that stage the kernel determines which system call is invoked, and starts processing it.

---
# System Call in a Nutshell
<div style="text-align:center"><img src="include/syscall-detailed5.svg" width=800 /></div>

???

- And when it is done, the kernel restores the task state by popping all the registers from the kernel stack.

---
# System Call in a Nutshell
<div style="text-align:center"><img src="include/syscall-detailed6.svg" width=800 /></div>

???

- And then user space execution resumes: the CPU switches to user mode and resumes the task's execution at the next instruction following the `syscall`.

---
class: center, middle, inverse
# Context Switches in a Nutshell

???

- And now let's see how things work for a context switch

---
# Context Switch in a Nutshell

???

- Same as with the previous example, let's assume a task running on the CPU, we'll call it task A
- We also assume a task B that is currently not running
- Each task has its code and stack
- We also have the kernel code and a kernel stack for each task A and B

---
# Context Switch in a Nutshell

???

- Context switching involves privileged operations and as such it can only be done by the kernel, so assume that at some point there is a trap to the kernel, following for example a system call
- The user CPU state of task A is saved on the corresponding kernel stack, as we've seen previously
- Because it invoked the kernel through the system call, task A is here defined as the **current task**.
- Another thign we say is that the kernel runs **in the context of task A**.

---
# Context Switch in a Nutshell

???
- Now let's assume that at this point the scheduler decides that task B should replace A on the CPU, for example because B is ready to run and has a higher priority than A
- The kernel needs to context switch from A to B
- The context switch operations starts by saving the state of the task getting scheduled out of the CPU, pushing all registers value on its stack.
- Note that what is saved here is the state of kernel execution for A, which is different from the state of user execution that was saved upon system call or interrupt

---
# Context Switch in a Nutshell

???
- Next the state of kernel execution for the task B getting scheduled in is restored from its own stack
- That state was saved there the last time B was scheduled out of the CPU.

---

# Context Switch in a Nutshell

???

- B is now the current task, and its execution can resume, in kernel mode first

---

# Context Switch in a Nutshell

???

- And finally the kernel returns to user space, and B resumes

---

class: middle, inverse, center
# System Call & Context Switch Implementations in a Real OS 🐧

???

- So we have seen from a theoretical perspective how the system call handling and context switch operation look like
- Now let's have a look at some code
---
# The Linux Kernel

.leftcol[
- **Most widespread OS kernel:**
  - Dominates the mobile (Android), server, supercomputers markets
  - Widespread in embedded systems
]

.rightcol[
<div style="text-align:center"><img src="include/Tux.png" width=220 /></div>
]

???

- We'll study the Linux kernel code
- It is arguably the most widespread operating system today, dominating many markets such has server computers, supercomputers, and of course the mobile market as Android uses Linux as kernel
- Linux also runs on a plethora of embedded devices

.leftcol[
- **Open Source**
  - Majority of contributors are from the industry
  - Will browse Linux's code using Elixir: https://elixir.bootlin.com/
]

.rightcol[
<div style="text-align:center"><img src="include/linux-contributors.png" width=450 /></div>
.center[Linux v6.6 top contributors, source: https://lwn.net/Articles/948970/]
]

???

- It's open source, and today the majority of contributors to the Linux kernel code are industry players
- To browse the code we will use a source code cross referencer called Elixir
- It is very practical to explore the sources of Linux, for example to do things like quickly search for a function definition
- A suggestion here is that you keep an Elixir window openned while watching this video, so that you can navigate the code by yourself alongside my comments
- One last things before we start digging into the code
- The Linux kernel is probably one of the most complex pieces of software out there
- Don't be intimidated with the code you'll be seeing on the screen, most of it you don't need to understand
- We will only concentrate on the key operations that I have highlighted in the theoretical examples we just seen
- I just want you to get a good high level idea of how these things work, and how they can be implemented

---
# Context Switch & System Call in Linux

- **Context switch & system call handling can only be done by the kernel**
- Example with `nanosleep` system call (system call + context switch)
  - System call invoked by the standard C library when the program calls `sleep`

???
- So we are going to study the kernel code implementing system call handling and context switching
- These are privileged operations that cannot be performed by a standard user program, they need to be done by the kernel
- The particular scenario we are going to look at is the execution in the kernel of the `nanosleep` system call
- This is the system call that is invoked by the standard C library when a program calls the `sleep` function

???

- So imagine the following scenario
- We have a user program, called task A that runs, executing code on the CPU
- At some point it invokes the `nanosleep` system call
- The kernel then takes over and start to run, processing this system call
- Seeign that task A wants to be put to sleep it finds another task that is currently not running but ready to run on the CPU, task B
- It performs the context switch, removes A from the CPU and prepares B for execution
- And then the kernel returns to user space: task B starts to run, it resume at the stage it was last interrupted
- Let's look at the code executed by the kernel for this scenario, that contains both the handling of the `nanosleep` system call and the context switching between tasks A and B

---
# System Call Handling in Linux

- System call entry point is in [entry_SYSCALL_64](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/entry/entry_64.S#L87) in `arch/x86/entry/entry_64.S`:

```xml
SYM_CODE_START(entry_SYSCALL_64)
    // ...
    movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)        // save user stack pointer
    // ...
    movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp // load kernel stack pointer
    // ...
    PUSH_AND_CLEAR_REGS rax=$-ENOSYS                       // save user registers
    movq	%rsp, %rdi                                     // prepare do_syscall_64 arguments
    movslq	%eax, %rsi
    // ...
    call	do_syscall_64                                  // switch to c Code
```

???

- When the `syscall` instruction is executed by a task wishing to invoke the `nanosleep` system call, the CPU switches to supervisor mode, and jumps to the system call entry point.
- On Linux the kernel defines the system call entry point as the label `entry_SYSCALL_64` in an assembly file.
- The assembly code starts by saving the user stack pointer in a scratch space and switching the stack pointer register (`%rsp`) to a kernel stack
- Next, the values of many CPU registers that correspond to task A's **user state** when it invoked `syscall` are pushed on the kernel stack.
- The user CPU state needs to be saved so that it can be restored later when the application resumes: by doing so the interruption that corresponds to handling the system call will be completely transparent to the application's execution.
- Most of that state saving job is realised with the macro `PUSH_AND_CLEAR_REGS`which expands among other things into a series of push operations for the general purpose registers
---------------------------
- After that the system call entry point is going to jump to a function in C code.
- This function takes 2 arguments:
  - The first parameter is a pointer to a data structure of type `pt_regs` containing the user state of the CPU that was just pushed to the stack
  - The second is the identifier for the system call being invoked: in the case of `nanosleep` it will be the number 35
---------------------------
- We need to prepare these parameters
- Remember the Linux calling convention: the first parameter of a function is passed in the `%rdi` register, and the second in `%rsi`.
- So here, the value of the stack pointer, which is a pointer to the task user state `pt_regs` that was just pushed to the stack, is written in `%rdi`.
- Also remember that the system call convention states that the system call identifier should be set in `%rax`: its lower 32 bits (aliased as `%eax`) are moved to `%rsi`.
---------------------------
- The system call entry point finally jumps to C code by calling the `do_syscall_64` function

---
# System Call Handling in Linux

[`do_syscall_64`](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/entry/common.c#L76) gets register state and system call ID as parameters:

```c
__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr);
```

???
- As mentioned earlier, `do_syscall_64` takes the registers' state and system call identifier as parameters

Then calls [do_syscall_x64](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/entry/common.c#L42) which itself calls [x64_sys_call](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/entry/syscall_64.c#L30):

```c
long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
{
    switch (nr) {
    #include <asm/syscalls_64.h>
    default: return __x64_sys_ni_syscall(regs);
    }
};
```

???

- It calls `do_syscall_x64` which itself calls `x64_sys_call`.
- This function is basically a big switch case on the identifier of the system call that was invoked by the current task.
- The code embedded in the middle with `#include` is a jump table mapping each system call identifier to the corresponding C function handling that system call in the kernel.
- The included code is automatically generated at compile time from the Linux system call table.

Syscall jump table included is generated from the [system call table](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/entry/syscalls/syscall_64.tbl), e.g. for `nanosleep` the entry is:

```
35	common	nanosleep		sys_nanosleep
```

???

- For example for the `nanosleep` system call we have the following entry
- So in the case of that system call being invoked, the switch in `x64_sys_call` will jump to the function `sys_nanosleep`.

---
# System Call Handling in Linux

[`sys_nanosleep`](https://elixir.free-electrons.com/linux/v6.10.3/source/kernel/time/hrtimer.c#L2104) calls [`hrtimer_nanosleep`](https://elixir.free-electrons.com/linux/v6.10.3/source/kernel/time/hrtimer.c#L2069), itself calling [`do_nanosleep`](https://elixir.free-electrons.com/linux/v6.10.3/source/kernel/time/hrtimer.c#L2021) that contains the main logic for `nanosleep`:

```c
static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
{
	struct restart_block *restart;

do {
		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
		hrtimer_sleeper_start_expires(t, mode);

if (likely(t->task))
*			schedule();

hrtimer_cancel(&t->timer);
		mode = HRTIMER_MODE_ABS;

} while (t->task && !signal_pending(current));

/* ... */
```

???

- `sys_nanosleep` calls `hrtimer_nanosleep`] which in turn calls `do_nanosleep` that contains the core logic of the sleep operation
- The function `schedule` is called in a loop as long as the nanosleep delay is not spent.
- That function is implemented by the Linux scheduler, and will effectively set the current task to sleep and trigger a context switch to another task ready to run.

---
# System Call Handling in Linux

[`schedule`](https://elixir.free-electrons.com/linux/v6.10.3/source/kernel/sched/core.c#L6827) calls [`__schedule_loop`](https://elixir.free-electrons.com/linux/v6.10.3/source/kernel/sched/core.c#L6818), which itself calls the main scheduler function: [`__schedule`](https://elixir.free-electrons.com/linux/v6.10.3/source/kernel/sched/core.c#L6614):

```c
static void __sched notrace __schedule(unsigned int sched_mode) {
    /* ... */
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    prev = rq->curr;         // get the task_struct for the current task
    /* ... */
    local_irq_disable();     // disable interrupts
    /* ... */
    next = pick_next_task(rq, prev, &rf); // get the task_struct of the next task to run
    /* ... */
    if (likely(prev != next)) {
        /* ... */
        rq = context_switch(rq, prev, next, &rf); // perform context switch from prev to next
    /* ... */
```

???

- `schedule` calls `__schedule_loop`, which in turn calls the main scheduler function: `__schedule`.
- This function first obtain a data structure of type `task_struct` representing the current task into a variable named `prev`
- Here `cpu` identifies the core executing the code, and `rq` is its runqueue: the list of tasks ready to run on that core.
- Next the scheduler gets the `task_struct` of the next task that should run on this core, into `next`

[`local_irq_disable`](https://elixir.free-electrons.com/linux/v6.10.3/source/include/linux/irqflags.h#L200) ends up calling [`native_irq_disable`](https://github.com/torvalds/linux/blob/v6.10/arch/x86/include/asm/irqflags.h#L35):
```c
static __always_inline void native_irq_disable(void) { asm volatile("cli": : :"memory"); }
```

???

- The `__schedule` function disables interrupts with `local_irq_disable`, which ends up calling `native_irq_disable`. For x86-64 it looks like this.
- Here inline assembly is used to call the `cli` instruction that effectively disables interrupts.
- Let's go back to `__schedule`, at that stage the kernel checks if the next task is different from the previous one
- If `prev` and `next` represent the same task, there is no need for a context switch.
- If the tasks are different, then the kernel needs to context switch to the next task.
- To that aim the `context_switch` function is called.

---
# Context Switch in Linux

- A context switch swaps the CPU state from the execution context of the **previous task** to that of the **next task**
- State includes, among others:
    - General purpose registers
    - Stack pointer (each task has its own kernel stack)
    - Control register holding the root of the page table, (each task has its own address space)

???

- The kernel is now going to perform the context switch, which consists in switching the content of many CPU registers from the state that corresponds to the execution of the previous task A to that of the next task B.
- Some examples of registers which value is switched are:
  - The general purpose registers.
  - The stack pointer (each task has its own kernel stack).
  - The register holding the root of the page table, because each task has its own address space.

---
# Context Switch in Linux

Done by the [`context_switch`](https://elixir.free-electrons.com/linux/v6.10.3/source/kernel/sched/core.c#L5348) function:

```c
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf) {

if (!next->mm) {
        /* switching to a kernel thread, no need to switch page tables ... */
    } else {
        /* ... */
        switch_mm_irqs_off(prev->active_mm, next->mm, next); // switch page tables
        /* ... */
    }
    /* ... */
    switch_to(prev, next, prev);  // switch registers including the stack pointer
    /* ... */
}
```

???
- The function `context_switch` takes as parameters the runqueue for the current core, pointers to the previous and next tasks, as well as to some flags. It starts by checking what type of task we are switching to
- If it is a kernel thread the context switch will not involve a page table switch because the memory corresponding to the kernel is mapped in the address space of all user programs
- If we are switching to a standard user task, the corresponding page table needs to be installed

[`switch_mm_irqs_off`](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/mm/tlb.c#L501) ends up calling [`native_write_cr3`](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/include/asm/special_insns.h#L52):

```c
static inline void native_write_cr3(unsigned long val) {
    asm volatile("mov %0,%%cr3": : "r" (val) : "memory");
}
```

???

- This is done with `switch_mm_irqs_off`,  which executes a bunch of code and in particular switches the content of the `%cr3` x86-64 register, that hodls the root of the page table, with `native_write_cr3`
- Here the inline assembly takes `val` (root of the page table of the next task) and use the `mov` instruction to write it in `%cr3`.
-------------
- When the page table is taken care of, `context_switch` calls `switch_to` to handle the rest of the CPU state that needs to be switched

---
# Context Switch in Linux

[`switch_to`](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/include/asm/switch_to.h#L49) calls [`__switch_to_asm`](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/entry/entry_64.S#L177) that performs the register switch:

```xml
SYM_FUNC_START(__switch_to_asm)
    pushq	%rbp                // save callee-saved registers representing the kernel
    pushq	%rbx                // execution state of the current task (i.e. the task being
    pushq	%r12                // scheduled out) by pushing them to the stack
    pushq	%r13
    pushq	%r14
    pushq	%r15
    movq	%rsp, TASK_threadsp(%rdi)    // save curent task's stack pointer

movq	TASK_threadsp(%rsi), %rsp    // load the next task's stack pointer
    /* ... */
    popq	%r15                // restore the callee-saved registers representing the kernel
    popq	%r14                // execution state of the task being scheduled in
    popq	%r13
    popq	%r12
    popq	%rbx
    popq	%rbp

jmp	__switch_to
```

???
- `switch_to` calls `__switch_to_asm` whcih is implemented in an assembly file and performs the general purpose register switch
- At the beginning of the function the stack pointer (register `%rsp`) still references the previous task's stack.
- The callee-saved registers representing the previous task's CPU kernel state are saved on that stack with a series of `push` instructions.
- Then the previous task's stack pointer is saved on the stack with a `movq` instruction, and the next task's stack pointer is installed with another `movq`.
- Now that the stack pointer references the stack of the next task, with a series of `pop` instructions the values of the callee-saved registers for that task are loaded.
- These values were previously saved on the next task's stack the last time it was scheduled out of the CPU, similarly to what we just described for the previous task.

- [`__switch_to`](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/kernel/process_64.c#L610) restores more state (e.g. floating point registers)
- Then the newly scheduled task takes the return path

???

- Finally, the code goes back to C by jumping to `__switch_to`, which takes care of saving and restoring more CPU state including segment registers, floating point registers, among others.
- When `__switch_to` returns, the next task is now the current task.
- In other words the kernel now executes in the context of the next task B: as you know the return path is defined by the values of return addresses stored on the stack so we take the next task's return path up to `schedule`, which itself returns to whatever part of the kernel code invoked it the last time the next task was scheduled out.

---

# Context Switch in Linux

- Assume the last time the current task was scheduled out it was following a `nanosleep` system call

???

- Let's assume that at this occasion the next task, task B, was scheduled out following an explicit system call to `nanosleep`.
- `schedule` returns back to `do_nanosleep`, which exits its waiting loop and returns to `hrtimer_nanosleep`, which itself return to the top level implementation of `nanosleep`, i.e. `sys_nanosleep`.

- Execution takes the return path up to the system call handler `entry_SYSCALL_64`, [right after the call to `do_syscall_64`](https://elixir.free-electrons.com/linux/v6.10.3/source/arch/x86/entry/entry_64.S#L121):

```xml
SYM_CODE_START(entry_SYSCALL_64)
    /* ... */
    call	do_syscall_64
    /* ... */
    POP_REGS pop_rdi=0      // Restore user space state for general purpose registers
    movq	%rsp, %rdi
    movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
    pushq	RSP-RDI(%rdi)	/* RSP */
    /* ... */
    popq	%rsp           // Restore user space stack pointer from scratch space
    sysretq                // return back to user space
```

???

- The system call handling code returns back from `do_syscall_64` in the assembly system call entry point.
- The user space state of the next task is then popped from the stack where it was saved the last time the task entered the kernel (when it called `nanosleep`)
- The macro `POP_REGS` expands into a series of `popq` instructions restoring the value of general purpose registers.
- The kernel stack pointer is then saved to a scratch space, and from that scratch space the user space stack pointer is popped into `%rsp`
- Finally, we exit kernel code to return to user space with the `sysretq` instruction

???

- Task B resumes its execution

---
# Summary
- We have seen how syscall handling and context switching are implemented
- It's made with a combination of C and assembly
  - Assembly is necessary for low level operations such as saving CPU state or switching tasks
  - It integrates very well with C

----
.center[Feedback form: https://bit.ly/2VD4kfx]
<div style="text-align:center"><img src="include/qr-code.png" height=150 /></div>

???
- And that's it!
- We have seen how the Linux kernel implements these two key operating system mechanisms that are system call handling and context switching
- It's done in majority in C, but with a fair bit of assembly, which integrates very well with C