Case Study: Operating System Kernel

You can access the slides 🖼️ for this lecture. All the code samples given here can be found online, alongside instructions on how to bring up the proper environment to build and execute them here. You can download the entire set of slides and lecture notes in PDF from the home page.

Here we discuss a last and third use case where using C is appropriate: the implementation of an operating system (OS) kernel.

System Calls and Context Switches

Because of its proximity with the hardware and performance, C is heavily used in OS kernels. Here we will study how two key OS operations are implemented: system call and context switch. We already talked about how user space programs invoke system calls in the previous lecture regarding the C standard library. Here we'll get an overview of how they are handled on the kernel side. Regarding context switches, it corresponds to the action of replacing a task running on the CPU by another one:

We'll see an overview of how the kernel manages that, by studying the Linux kernel code. Linux is used absolutely everywhere: in all Android phones and most embedded or IoT systems, on the majority of server computers, as well as super computers. Linux also happens to be the operating system currently running on my computer.

System Calls in a Nutshell

When a user space task is executing on the processor, the CPU state of this task consists in the values present in the processor registers:

We have general purpose registers used for computations, but also special-purpose ones such as stack and instruction pointers. As its name suggests, the stack pointer points to the stack, a contiguous memory area used to store various things such as local variables. The instruction pointer (also known as program counter) points to the instruction currently executed by the CPU, located inside the program code in memory. When running user programs the CPU is in user mode, i.e. for security reasons it cannot perform privileged operations which execution is reserved to the kernel, such as installing a new page table or shutting down the CPU.

When the system call instruction is issued (e.g. syscall for x86-64), the CPU switches to supervisor mode, also known as kernel mode, giving it the permission to execute every operation supported by the CPU, including privileged ones. The CPU also jumps to a predefined system call handler in the kernel, which switches to the kernel stack:

The kernel starts by saving the application state by pushing all registers on the kernel stack:

Then the kernel determines which system call is invoked (e.g. on x86-64 by looking at the system call identifier in %rax) and starts processing it. When the system call has been processed, the kernel restores the task state by popping all the registers from the kernel stack:

And then user space execution resumes at the next instruction following the system call invocation instruction:

Context Switches in a Nutshell

Let's assume a single core CPU, with a task running (scheduled) on the CPU, we'll call it task A. We also assume a task B that is currently sleeping, i.e. it is not scheduled. Each task has its code and data (including its stack) in memory. In memory, we also have the kernel code, as well as a kernel stack for each task A and B:

Context switch should be done by the kernel, so let's assume at some point there is a trap to the kernel, either due to a system call or a hardware interrupt:

After the interrupt is treated, in many OSes the scheduler will check if there is a task ready with a higher priority than the one currently running. Let's assume task B is ready, and its priority is higher than the currently running task A. The scheduler then decides to run B instead of A and a context switch is needed.

The context switch operation starts by saving the state of the task getting scheduled out of the CPU, A, pushing all of its registers' values on its kernel stack:

Next the state of task getting scheduled, B, is restored from its own kernel stack, where it was previously saved the last time that task was scheduled out:

Then the execution of B can resume, in kernel mode first:

Then the kernel returns to user space, and B resumes:

System Call and Context Switch Implementations in a Real OS

Let's explore in a real world operating system how system call handling and context switching are implemented. We are going to have a look at how the Linux kernel code. Its code is obviously relatively complex, and we will not discuss most of the code present in the files and functions we will look at here: we will only focus on the key operations for you to get a good idea of how things work from a high level point of view.

A great way to browse the Linux kernel source is Elixir. All the references to the code we make below will point to this website.

We are going to study the code path in the kernel when an application calls the sleep function. It is implemented by the C standard library which will invoke the nanosleep system call to ask the kernel to put the calling program to sleep, which will force a context switch on the CPU to schedule in another task ready to run. So it's an exact illustration of the scenario we have depicted earlier.

1. System Calls: Entering the Kernel

1.1. System Call Entry Point

We have seen in the previous lecture on the C standard library that a user space program invokes a system call on x86-64 with the syscall instruction. When it is executed by the C library wishing to invoke the nanosleep system call, there is an exception and the user space program's execution is interrupted. The CPU switches to supervisor mode, and jumps to a predefined piece of kernel code, the system call entry point. When kernel code runs following a system call or any other form of interruption of user space code execution (e.g. a hardware interrupt received while a user task is running), we say that the kernel is running in the context of the task that was interrupted, and that task is named the current task. The overwhelming majority of kernel code invocations are made in the context of a user space task.

The kernel defines the system call entry point as the label entry_SYSCALL_64 in an assembly file. The assembly code starts by saving the user stack pointer in a scratch space and switching the stack pointer register (%rsp) to a kernel stack, as sharing a stack between user and kernel space would be unsafe:

movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)   // save user stack pointer
/* ... */
movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp // load kernel stack pointer

Next, through a series of stack push (pusq instruction) operations, the values of many CPU registers that correspond to the current task's user state when it invoked syscall are pushed on the kernel stack. The user CPU state needs to be saved so that it can be restored later when the application resumes: doing so the interruption that corresponds to handling the system call will have been completely transparent to the application's execution. Most of that state saving job is realised with the macro PUSH_AND_CLEAR_REGS which expands among other things into a series of pushq for the general purpose registers:

PUSH_AND_CLEAR_REGS rax=$-ENOSYS

After that the system call entry point is going to jump to a function in C code. This function takes 2 arguments. The first parameter is a pointer to a data structure of type pt_regs containing the state of the CPU that was just pushed to the stack (register's values at the time the application invoked syscall). The second is the identifier for the system call being invoked: in the case of nanosleep it will be 35.

Remember the calling conventions: the first parameter of a function is passed in the %rdi register, and the second in %rsi.

movq	%rsp, %rdi
movslq	%eax, %rsi

Here, the value of the stack pointer, which is a pointer to the task user state pt_regs that was just pushed to the stack is written in %rdi. Also remember that the system call convention states that the system call identifier should be set in %rax: its lower 32 bits (aliased as %eax, there are only about 400 system calls, so 32 bits is largely enough to identify them all) are moved to %rsi.

The system call entry point finally jumps to C code by calling the do_syscall_64 function:

call	do_syscall_64

Assembly files can be assembled into object files which can be linked with other object files resulting from the compilation of C code into an executable program. As seen above symbols (e.g. function names, global variables) can be shared: C functions can be called from assembly, and C code can jump to assembly labels.

1.2. System Call C Handler

As mentioned earlier, do_syscall_64 takes the registers' state and system call identifier as parameters:

__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr);

It calls do_syscall_x64 which itself calls x64_sys_call. That function looks like this:

long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
{
	switch (nr) {
	#include <asm/syscalls_64.h>
	default: return __x64_sys_ni_syscall(regs);
	}
};

It's a switch case on the identifier of the system call that was invoked by the current task. The code embedded in the middle with #include is a jump table mapping each system call identifier to the corresponding C function handling that system call in the kernel. The included code is automatically generated at compile time from the Linux system call table. For example for the nanosleep system call we have the following entry:

35	common	nanosleep		sys_nanosleep

So in the case of that system call being invoked, the switch in x64_sys_call will jump to sys_nanosleep.

1.3. Implementation of `nanosleep`

sys_nanosleep calls hrtimer_nanosleep, itself calling do_nanosleep that contains the core logic of the sleep operation: the function schedule is called in a loop as long as the nanosleep delay is not spent. That function is implemented by the Linux scheduler, and will effectively set the current task to sleep and trigger a context switch to another task ready to run.

2. The Scheduler

2.1. Grabbing Task Structs

schedule calls __schedule_loop, which itself calls the main scheduler function: __schedule. It is called on 2 occasions:

When a user programs explicitly blocks, e.g. our example with nanosleep.
Upon return from interrupt or system call.

This function first obtain a data structure of type task_struct representing the current task into a variable named prev:

cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;

Here cpu identifies the core executing the code, and rq is its runqueue: the list of tasks ready to run on that core.

Next the scheduler gets the task_struct of the next task that should run on this core:

next = pick_next_task(rq, prev, &rf);

The rules based on which the next task to run is selected are many, and depend on the model of the scheduler used (Linux has several schedulers): it can be based on priorities, fair division of time between tasks, etc. Here just know that next is the task struct of whatever task the scheduler decided was going to run on the current core next.

In the rest of this document we will refer to the task being scheduled out of the CPU as the previous task (here prev), and to the task to be scheduled as the next task (next).

2.2. Disabling Interrupts

A bit earlier in the __schedule function, interrupt are disabled with local_irq_disable, which ends up calling native_irq_disable which is an architecture specific function. For x86-64 it looks like this:

static __always_inline void native_irq_disable(void)
{
	asm volatile("cli": : :"memory");
}

Here inline assembly is used to call the cli x86-64 instruction that effectively disables interrupts.

As a side note we have here what may look like a function implementation within a header file, something we previously said was not OK. In this case this is actually fine, because the function in question is declared as inline.

2.3. Determining if a Context Switch is Needed

Let's go back to __schedule. At that stage the kernel checks if the next task is different from the previous one:

if (likely(prev != next)) {

If prev and next represent the same task, there is no need for a context switch: the function will return, and soon the kernel will return to the current task for it to resume. If the tasks are different, then the kernel needs to context switch to the next task. To that aim the context_switch function is called.

3. Context Switch

As explained earlier the context switch consists in switching the content of many CPU registers from the state that corresponds to the execution of the previous task to that of the next task. Notable registers which value is switched include:

The general purpose registers.
The stack pointer (each task has its own kernel stack).
The register holding the root of the page table, because each task has its own address space.

Note that here we are running kernel code and the context switch happens in kernel context: we'll be switching the CPU state of the kernel running in the context of the previous task to the state of the kernel running in the context of the next task. This is not to be mixed up with the user to kernel state switching we have seen earlier, that happens upon system calls and other forms of interrupts.

3.1. Switching Page Tables

context_switch takes as parameters the runqueue for the current core, pointers to the previous and next tasks, as well as to some flags. It starts by checking what type of task we are switching to:

if (!next->mm) {

Tasks can be either standard user space programs, or kernel threads. Switching to a kernel thread (when next->mm is 0) does not require a page table switch because the memory corresponding to the kernel is mapped in the address space of all user programs (although for obvious security reasons not directly accessible from user code): the address space currently in use is that of the previous task, and it will contain all the kernel memory that any kernel thread may need to address. If we are switching to a standard user space task, we need to switch page tables to install its address space. This is the job of switch_mm_irqs_off, which executes a bunch of code and in particular switches the content of the %cr3 x86-64 register holding the root of the page table with native_write_cr3:

static inline void native_write_cr3(unsigned long val)
{
	asm volatile("mov %0,%%cr3": : "r" (val) : "memory");
}

Here the inline assembly takes val (root of the page table of the next task) and use the mov instruction to write it in %cr3.

3.2. Switching Stack Pointer and General Purpose Registers

Now that the page table is taken care of we have loaded the address space of the next task. The CPU state corresponding to this task now needs to be switched too. If we go back to context_switch, at that stage it will call switch_to to handle that, which itself will call the x86-64 architecture-specific function __switch_to_asm. This function is defined in an assembly file, and contains the core logic of the CPU state switch:

pushq	%rbp
pushq	%rbx
pushq	%r12
pushq	%r13
pushq	%r14
pushq	%r15

movq	%rsp, TASK_threadsp(%rdi)
movq	TASK_threadsp(%rsi), %rsp

popq	%r15
popq	%r14
popq	%r13
popq	%r12
popq	%rbx
popq	%rbp

jmp	__switch_to

At the beginning of the function the stack pointer (register %rsp) still references the previous task's stack. The callee-saved registers representing the previous task's CPU kernel state are saved on that stack with a series of push instructions. Then the previous task's stack pointer is saved on the stack with a movq instruction, and the next task's stack pointer is installed with another movq. Now that the stack pointer references the stack of the next task, with a series of pop instructions the values of the callee-saved registers for that task are loaded. These values were previously saved on the next task's stack the last time it was scheduled out of the CPU (i.e. it is the next task's CPU kernel state), similarly to what we just described for the previous task. Finally, the code goes back to C by jumping to __switch_to.

__switch_to takes care of saving/restoring more CPU state including segment registers, floating point registers, among others. When __switch_to returns, the next task is now the current task. In other words the kernel now executes in the context of the next task: as you know the return path is defined by the values of return addresses stored on the stack, so we take the next task's return path up to schedule, which itself returns to whatever part of the kernel code invoked it the last time the next task was scheduled out.

4. System Call: Exiting the Kernel

Let's assume that at this occasion the next task was scheduled out following an explicit system call to nanosleep. schedule returns back to do_nanosleep, which exits its waiting loop and returns to hrtimer_nanosleep, which itself return to the top level implementation of nanosleep, i.e. sys_nanosleep. The system call handling code returns back from do_syscall_64 in the assembly system call entry point. The user space state of the next task is then popped from the stack where it was saved the last time the task entered the kernel (when it called nanosleep):

POP_REGS pop_rdi=0

POP_REGS expands into a series of popq instructions restoring the value for general purpose registers.

The kernel stack pointer is then saved to a scratch space, and from that scratch space the user space stack pointer is popped into %rsp:

movq	%rsp, %rdi
movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
pushq	RSP-RDI(%rdi)	/* RSP */
/* ... */
popq	%rsp

Finally, we exit kernel code to return to user space:

sysretq

The sysretq instruction switches the processor mode from supervisor to user (i.e. unprivileged) mode, and jumps to the first user space instruction after syscall: the application resumes its execution.

Conclusion

And that's it! Through this example we have seen how C is a very practical language for writing operating systems as they require, for some very specific operations such as system call handling and context switching, close interactions with assembly code (inline as well as in the form of assembly files) as well as full control over the location/layout of data structures in memory.

COMP26020 Part 1: C Programming