Linux Kernel Development

PDF


Chapter 1/2: Introduction

  • Everything is a file, except sockets

  • libc doesn’t map 1:1 to system calls, but provides other niceties (buffering, for example)

  • Hardware raises interrupts, interrupting the kernel, which responds by invoking the appropriate interrupt handler in an _interrupt context

  • in Linux, we can generalize that each processor is doing exactly one of three things at any given moment:

    • In user-space, executing user code in a process
    • In kernel-space, in process context, executing on behalf of a specific process
    • In kernel-space, in interrupt context, not associated with a process, handling an interrupt
  • Linux requires an MMU, but specialized versions (embedded?) can run without one

  • Linux is monolithic, and so is a single static binary that uses a single address space, but can still load modules dynamically

  • Windows NT and Mach are microkernels, but run all “servers” in kernel-mode so as to reduce context-switching overheads

  • Processes can be preempted, even when[ executing in kernel-mode during a syscall

  • Linux takes an interesting approach]() to thread support: It does not differentiate between threads and normal processes.To the kernel, all processes are the same— some just happen to share resources.

  • Kernel memory can’t be swapped to disk (no page faults/etc.)

  • Versioning

    • 300
    • Odd minor versions are development kernels, even minor versions are production kernels
  • The kernel source is typically installed in /usr/src/linux

  • 500

  • Configuration

    • Kernel features are configured by CONFIG_ directives
    • Build flags are booleans or tristates (true, false, module), allowing drivers to be either compiled into the kernel (true) or to be enabled as modules (module)
    • make config to interactively set configuration flags
    • make menuconfig is a curses alternative (or gconfig for GTK)
    • make defconfig uses the defaults for the current architecture
    • .config contains a full list of concrete config flags
    • make oldconfig to validate .config
    • The currently running kernel can be induced to writing its configuration to /proc/config.gz
    • And then just make to build
  • Kernel restrictions vs. regular C code

    • No libc
      • Kernel code can’t access libc or use standard C headers
      • Common APIs are reimplemented, like linux/string.h, located at include/ in the source tree. No printf, use printk instead
    • GNU C, not ANSI C
      • Includes GNU extensions, but (at the time of writing) only GCC could compile the kernel; the relevant extensions are:
        • Inline functions
        • Inline asm
        • likely()/unlikely() to annotate branches
    • No memory protection
      • No SIGSEGV, etc.
      • Kernel memory isn’t paged, everything has to sit in physical memory
    • Floating point
    • Small stack
      • Configurable at compile time, either 4KB or 8KB
      • Not resizable at runtime

Chapter 3: Process Management

  • fork is implemented via the clone syscall
  • Parent processes wait on children using wait; completed children are in a zombie state until waited on
  • The kernel uses a circular doubly linked list to store the list of processes, each entry is a task_struct (appears to be defined in sched.h)
    • Instances of task_struct are 1.7kB on 32-bit machines

    • Allocated by the slab allocator (multiple free lists, some containing pre-warmed copies of commonly allocated structs)

    • Prior to 2.6, task_struct was stored at the end of the kernel stack for each process, so its location could be determined with just the stack pointer

    • Now that it’s dynamically allocated (and kernel memory can’t use VM) the end of the stack now contains a thread_info struct that points to the task_struct

    • 400

    • Pids are typically an int, but are limited to 32768 by default, and be configured up to a limit of 4 million:

      /*
       * This controls the default maximum pid allocated to a process
       */
      #define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000)
      
      /*
       * A maximum of 4 million PIDs should be enough for a while.
       * [NOTE: PID/TIDs are limited to 2^30 ~= 1 billion, see FUTEX_TID_MASK.]
       */
      #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
        (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
      
    • Each architecture implements the current macro to point to the task_struct of the currently running process

      • Some use a register to store the pointer, like PowerPC
      • Others (like x86, which apparently doesn’t have registers to spare) use the thread_info struct at the end of the kernel stack
      • 600
  • Process states
    • TASK_RUNNING: running or waiting to run

    • TASK_INTERRUPTIBLE: blocked/sleeping, wakes up when unblocked or a signal is received

    • TASK_UNINTERRUPTIBLE: blocked/sleeping, wakes up only when unblocked

      • These show up with the D state in ps and can’t be killed with kill
    • __TASK_TRACED: task is being traced by a debugger

    • __TASK_STOPPED: execution has stopped, in response to SIGSTOP/etc.

    • ps process states

      Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a process:
         D    uninterruptible sleep (usually IO)
         I    Idle kernel thread
         R    running or runnable (on run queue)
         S    interruptible sleep (waiting for an event to complete)
         T    stopped by job control signal
         t    stopped by debugger during the tracing
         W    paging (not valid since the 2.6.xx kernel)
         X    dead (should never be seen)
         Z    defunct ("zombie") process, terminated but not reaped by its parent
      
    • set_current_state to change the state of the currently running process

  • Process hierarchy
    • All processes descend from pid 1, which is typically init

      ❯ file /sbin/init
      /sbin/init: symbolic link to ../lib/systemd/systemd
      
    • The kernel starts the init process as the last step of te boot process

    • Each task_struct is linked to that process' parent & children

    • It looks like (on v5.18) task_struct maintains a list of siblings too

    • init’s task_struct is statically allocated as init_task

    • Use (task_struct *)->tasks to iterate all tasks in the system, tasks is a flat doubly linked list that each new task is added to.

  • Process creation
    • Forked processes have a unique PID and a different “parent PID” and don’t inherit pending signals, but are otherwise identical to the forking process
    • All shared resources are backed by CoW pages so forking is fast
    • Page tables and the task_struct have to be actually copied during a fork
    • Forking uses the clone syscall, which allows specifying the specific resources that the parent and child should share
    • In kernel space, clone starts off at kernel_clone in kernel/fork.c (as of 5.18, the book is out of date)
      • It calls copy_process, which
      • Duplicates task_struct and sets a number of flags
      • Allocates a new PID
      • Either copies or shares (based on the flags to clone) open fds, signal handlers, address space, etc.
    • vfork is a variant that doesn’t copy the parent’s page table entries; the parent is blocked until the child returns or calls exec
    • Threads
      • The kernel has no notion of threads (specifically); they’re just processes that share resources with other processes

      • Every thread has its own task_struct

      • Threads are created with clone too, and are identical to a fork except that they share an address space, open fds, signal handlers, and “fs info” (not sure what this is yet). These flags are defined in sched.h, and are presumably grouped in this way in pthreads:

        clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
        
      • These flags make reference to a TID (thread id?) and TLS (thread-local storage)

      • Kernel threads don’t have a VM address space, but are otherwise identical to normal threads

      • ps -ef to see kernel threads

    • do_exit in kernel/exit.c is called when a task exits, and handles things like setting the error code in the process' task_struct and signaling the parent. The process is a zombie at this point until waited on by the parent.
      • When the process is finally reaped, the task_struct is deallocated, the process is removed from the task list, and the process' kernel stack and thread_info struct are deallocated too.
      • If the parent exits before the child, the child is reparented onto either another process in the current “thread group” or init

Chapter 4: Process Scheduling

  • Processes that block on IO are put on a wait queue, and are put back on a runqueue when the IO is done
  • Two types of multitasking OSes
    • Cooperative: processes yield control explicitly
    • Preemptive: processes can be interrupted at any point
  • O(1) scheduler introduced in the 2.5 series, constant-time algorithm
    • Better than the predecessor, worked well for server workloads, but didn’t work well with interactive applications
    • Replaced by the Completely Fair Scheduler in 2.6.23
  • Low latency vs. high througput
    • Favors IO-bound processes, delivering low latency (time to first response)
  • Priority-based scheduling
    • Processes with higher priority run before processes with lower priority
    • Round robin for processes at the same priority level
    • This sounds like the MLFQ from
      Book
      Operating Systems - Three Easy Pieces
    • Two separate priority ranges:
      • nice value: -20 to +19, default is zero, larger values are lower-priority
        • Controls the proportion of timeslice a process receives
      • real-time priority: 0 to 99, higher values are higher-priority
        • All real-time processes are at a higher priority level than normal/niced processes
  • Timeslices
    • Smaller timeslices favor IO-bound processes, but context switching overhead may dominate
    • Larger timeslices favor CPU-bound processes
    • CFS doesn’t assign a concrete timeslice, but a “proportion of the processor”, weighted by the nice value
    • CPU-bound processes are free to use more than their proportion of the processor
    • When IO-bound processes are awoken, they immediately preempt running CPU-bound processes if they haven’t used their proportion of the processor yet
    • Is this “proportion of the processor from boot time until the current instant” or is time divided into epochs/etc.?
  • Scheduler algorithms are pluggable, and can even coexist by being assigned to different groups of processes each
    • Each algorithm has a priority, highest-priority algorithm with a runnable process wins
    • CFS is the default frror normal processes (SCHED_NORMAL)
  • Not ideal to link a nice value directly to a timeslice duration
    • Background jobs are typically given higher nice values and interactive jobs are given lower nice values
      • But higher nice values will be given smaller timeslices and lower nice values will be given larger timeslices
      • Which is exactly the opposite of what you want
    • Nice values must be linked to timer ticks
    • Constant switching rate but variable fairness
      • As opposed to CFS, which yields a variable switching rate but constant fairness
  • CFS high-level operation
    • Pick a duration in which all runnable tasks are expected to run at least once (this is 20ms by default), call this targeted latency

    • With N runnable tasks, each is executed for (20/N)ms, assuming all at the same priority level (regardless of the specific priority they have)

    • If the tasks have different priorities, the % is weighted by the priority, but all runnable tasks still execute in the 20ms window

    • With the one caveat that a single task cannot execute for less than 1ms (to account for context switching costs). This is configurable and is called the minimum granularity

    • So if the targeted latency is 20ms and there are more than 20 runnable tasks, some of them won’t run in the 20ms window

    • Are the tasks that didn’t run in this scenario prioritized in the next 20ms window?

    • It looks like these numbers are a bit different in 5.18, and vary based on core count, and don’t appear to be configurable without CONFIG_SCHED_DEBUG set:

      /*
       * Targeted preemption latency for CPU-bound tasks:
       *
       * NOTE: this latency value is not the same as the concept of
       * 'timeslice length' - timeslices in CFS are of variable length
       * and have no persistent notion like in traditional, time-slice
       * based scheduling concepts.
       *
       * (to see the precise effective timeslice length of your workload,
       *  run vmstat and monitor the context-switches (cs) field)
       *
       * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
       */
      unsigned int sysctl_sched_latency			= 6000000ULL;
      static unsigned int normalized_sysctl_sched_latency	= 6000000ULL;
      
      /*
       * Minimal preemption granularity for CPU-bound tasks:
       *
       * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
       */
      unsigned int sysctl_sched_min_granularity			= 750000ULL;
      static unsigned int normalized_sysctl_sched_min_granularity	= 750000ULL;
      
    • Latency is 24ms and granularity is 3ms on my machine:

      ❯ zcat /proc/config.gz | grep SCHED_DEBUG
      CONFIG_SCHED_DEBUG=y
      ❯ sudo cat /sys/kernel/debug/sched/min_granularity_ns
      3000000
      ❯ sudo cat /sys/kernel/debug/sched/latency_ns
      24000000
      ❯ sudo ls /sys/kernel/debug/sched/
      debug	features		 latency_ns	  latency_warn_once  min_granularity_ns  numa_balancing  tunable_scaling  wakeup_granularity_ns
      domains  idle_min_granularity_ns  latency_warn_ms  migration_cost_ns  nr_migrate	 preempt	 verbose
      
      • This info used to be available in /proc but isn’t there anymore. Here’s some relevant documentation for /sys/kernel/debug: *

        Debugfs exists as a simple way for kernel developers to make information available to user space. Unlike /proc, which is only meant for information about a process, or sysfs, which has strict one-value-per-file rules, debugfs has no rules at all. Developers can put any information they want there. The debugfs filesystem is also intended to not serve as a stable ABI to user space; in theory, there are no stability constraints placed on files exported there.

    • For processes that don’t execute any syscalls, their runtime is only accounted for by the timer interrupt

      • The timer interval can be as large as 10ms, so a CPU-bound process may not be preempted for 10ms regardless of the targeted latency/timeslice
      • Chapter 11 for more details
  • CFS Implementation
    • struct sched_entity to track how long each process has executed, embedded in task_struct
    • Some of this data is visible at /proc/<pid>/sched
    • vruntime: actual runtime weighted by the number of processes, nice level, etc.
    • sum_exec_runtime: actual runtime
    • This bookkeeping data is updated periodically by the system timer, but also when a process becomes runnable or blocks
    • The basic scheduling mechanic is: the process with the lowest vruntime is scheduled to be run next
    • Uses a red-black tree (rbtree), which enables efficient (logn) searching
      • The node with the smallest vruntime is the leftmost leaf node in the tree
      • If we’re only searching for the process with the smallest vruntime, why not a min-heap instead?
      • How does this implementation fare wrt cache locality?
    • Blocked processes are removed from the rbtree and are placed on a wait queue, and schedule() (kernel/sched/core.c) is called to pick a new process to execute
    • When blocked processes are unblocked, they wake up and preempt currently running processes if their priority is higher
    • Spurious wake up: blocked processes wake up via signal, before the thing they’re blocking on is ready
    • The need_resched flag (within thread_info) is set when the kernel is supposed to preempt the currently-running process (by the scheduler or when a higher-pri blocked process wakes up). The flag is checked:
      • When returning to userspace (from a system call or interrupt handler)
      • When a kernel thread is holding zero locks
      • When a kernel thread blocks or calls schedule()
  • Real-time scheduling
    • Policies: SCHED_NORMAL (managed by CFS), SCHED_FIFO and SCHED_RR (managed by the real-time scheduler)
    • SCHED_FIFO tasks always have priority over SCHED_NORMAL tasks, and run to completion (potentially blocking and/or yielding along the way)
    • SCHED_RR is similar but with timeslices
    • RT-specific priority levels, round robin between RT tasks at the same priority levels
    • Soft-realtime guarantees, but timing perf is good in practice
  • Scheduler syscalls
    • 500
    • Affinity: the set of CPUs a process is allowed to run on, taskset is a utility to manage this externally
    • Forked processes inherit their parents' affinity mask (stored in task_struct)

Chapter 5: System Calls

  • System calls wrapped by a C library/API to avoid coupling userspace code directly to the kernel interface
  • libc writes errors into a global errno variable
  • getpid returns the thread group ID, so all threads in a process get the same PID
    • Does each thread have a separate pid/tid internally?
  • Kernel syscall handlers are sys_<syscall_name>
  • Every syscall has a syscall number
  • Raise a software interrupt to invoke a syscall, which is a specific hardware-defined interrupt type
    • INT $0x80 on x86
    • x86 more recently has the sysenter instruction to invoke a syscall without an interrupt
  • Pass the syscall number and args in registers (defined by the arch’s calling convention)
    • eax for the syscall number and the return value on x86
  • copy_from_user & copy_to_user to copy memory from/to userspace
    • Is there no reasonable way (apart from io_uring) to do this without copying?
  • Syscalls can’t accept args directly, lots of verification
    • Make sure pointers are valid in the process' address space
    • Pointers point to memory that is readable, writable, and/or executable when necessary
    • PIDs must be valid
    • FDs must be valid
    • The process/user must have permission to do the thing ← uses capabilities for this kind of thing
  • The kernel is in the process context when executing a syscall, current points to the process that invoked the syscall
    • This allows syscalls to be preempted, unlike raw interrupt handlers
  • Use the __syscallN family of macros to invoke a syscall without library support
  • Adding a syscall to the kernel
    • Update the listing in syscall64.tbl:

      --- a/arch/x86/entry/syscalls/syscall_64.tbl
      +++ b/arch/x86/entry/syscalls/syscall_64.tbl
       448	common	process_mrelease	sys_process_mrelease
       449	common	futex_waitv		sys_futex_waitv
       450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
      +451	common	foo	sys_foo
      
    • Add the syscall somewhere in kernel/:

      --- a/kernel/sys.c
      +++ b/kernel/sys.c
        return task_tgid_vnr(current);
       }
      
      +SYSCALL_DEFINE0(foo)
      +{
      +	printk(KERN_INFO "HELLO\n");
      +	return 0;
      +}
      +
       /* Thread ID - the internal kernel "pid" */
       SYSCALL_DEFINE0(gettid)
      
    • And run it with:

      #include <unistd.h>
      #include <stdio.h>
      
      int main(int argc, char **argv) {
        long res = syscall(451);
        printf("System call returned %ld.\n", res);
        return res;
      }
      
    • And it works!

      $ ./out && dmesg | tail -n1
      System call returned 0.
      [  215.002429] HELLO
      

Chapter 6: Kernel Data Structures

  • Linked lists

    • Linux’s implementation is a circular doubly-linked list
    • Embed a list node inside a data struct, don’t add pointers to the data struct directly:
    struct fox {
      unsigned long tail_length;
      unsigned long weight;
      bool is_fantastic;
    
      struct list_head list;
    };
    
    • fox.list.next points to the list_head of the next fox
    • Use list_entry to get to the next fox itself, which uses offsetof and pointer arithmetic to get to the containing struct
  • Queues

    • Generic queue implementation: kfifo
    • Fixed length, enqueues don’t work (blocking? probably not) when the queue is full
  • Maps

    • Hash tables are common, but can be implemented with balancing binary trees too
    • Hash tables have better lookup on average (O(1) vs O(logn)) but worse lookup in the worst case (O(n) vs O(logn) - all elements could be hashed to the same bucket)
    • The kernel provides a single map data structure (idr) used to map a unique ID (number) to a pointer
  • Trees

    • Primary implementation is a red-black tree (rbtree)
    • Users are expected to write their own search and insert routines
    • What is included, then?
  • The kernel also includes a few less-often-used data structures: radix trees & bitmaps

Chapter 7: Interrupts and Interrupt Handlers

  • How is the kernel to know when a hardware operation is done?

    • Polling (good for tasks that are done quickly) vs. interrupts (good for tasks that take a while/variable length)
    • Interrupts are physical signals directed to pins on an interrupt controller
      • Which multiplexes many interrupt lines into a single interrupt line for the processor
    • The processor stops executing and notifies the OS that an interrupt has occurred
    • Each interrupt type has a numeric value attached, called an IRQ (interrupt request) line
    • Interrupts are asynchronous wrt the process clock, exceptions are synchonous wrt the clock
    • The linux kernel uses the same infrastructure to handle interrupts and exceptions
    • Invoking a system call uses an exception, but is otherwise treated just like an interrupt
  • Each device is associated with an interrupt handler, defined in the device driver, and are invoked in the interrupt context

    • Interrupts can happen at any time, so interrupt handlers must execute quickly
    • The kernel can’t invoke anything that may sleep or block in interrupt context (including malloc)
    • Can multiple invocations of an interrupt handler for the same device run in parallel?
      • By default, an interrupt handler runs with that handler’s interrupt type disabled, but all other interrupt types enabled
      • It used to be possible to register an interrupt handler to disable all other interrupts when it’s invoked (IRQF_DISABLED), but this has been removed in favor of disabling all interrupts all the time: https://lwn.net/Articles/380931/
    • Interrupt handlers may also be required to perform a lot of work (network handlers have to DMA incoming data into memory)
    • These goals are at odds, so execution is split:
      • Top half: acknowledge interrupt + any other work that is time sensitive
      • Bottom half: deferred work that happens later
    • For example, a network card’s top half is typically just copying data to main memory, and its bottom half is to process/unwrap packets/etc.
  • Call request_irq to register an interrupt handler

  • Shared interrupt handlers

    • Multiple handlers can share the same IRQ line
    • Use a dev parameter to disambiguate invocations (and removal)
    • Does an invoked interrupt handler disable the entire line or just its own type?
  • The setup of an interrupt handler’s stacks is a configuration option. Historically, interrupt handlers did not receive their own stacks. Instead, they would share the stack of the process that they interrupted.

    • Does this mean interrupt handlers potentially have carte blanche over stack data for any process on the system?
    • In 2.6, interrupt handlers were given their own stack, one per processor, one page in size
  • 700

  • /proc/interrupts contains interrupt statistics

    • First column is the IRQ line number, next N columns are the number of interrupts received per CPU
    • Next is the interrupt controller, followed by the device
  • An interrupt handler can disable interrupts completely (for the current processor) or mask out a given line (for the entire machine)

Chapter 8: Bottom Halves & Deferring Work

  • Interrupt handlers need to run quickly because:

    • They block all other interrupt handlers
    • Hardware may require that the interrupt is handled in a given time window
  • Divide interrupt handling into two halves:

    • Urgent/time-sensitive operations are executed in-band
    • Slower/less-urgent work is performed offline
    • Blocking code can only be performed offline
  • Mechanisms

    • The original bottom half (BH) mechanism
      • 32 static pre-registered bottom halves
      • Top half sets one bit in a 32-bit integer to pick the bottom half to run
      • No two BHs could run at the same time (globally, not per IRQ/etc.)
    • Task queues
      • Eventual replacement for the BH mechanism, never really came to fruition
      • Pre-defined set of queues, top halves add functions to a given queue (linked list)
      • Each queue is executed at some point (different frequencies)
      • Too inflexible, eventually supplanted by…
    • softirqs & tasklets
      • softirqs are static bottom halves that can run concurrently (even two of the same type)
      • tasklets are dynamic bottom halves built on top of softirqs - can only run one instance of a given type of tasklet at once
      • In general tasklets are easier to use but less performant
    • work queues
      • Queue work to be performed later in process context
      • This is a third way to schedule bottom half execution
    • Use softirqs, tasklets, or work queues to defer work with no guarantees about actual execution time, use kernel timers to defer work to a specific instant in the future
    • 400
  • Softirqs

    • “Statically allocated at compile time” <- Kernel compile time or program compile time?

      • Looks like the former
      • 32 slots, 9 used at the time of writing, 10 today:
        ❯ cat /proc/softirqs
                        CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
              HI:      80792         20         14     343194         14         11          6         12
           TIMER:     516101     213184      95841     140075     249923     104994      91705     214533
          NET_TX:          6          3          0          8          7         10          1         13
          NET_RX:      20759      19622      19696      27211      24015     367052      22167   22064693
           BLOCK:      53864     466082       1018       1349       1037        690        805        798
        IRQ_POLL:          3          0          0          2          0          2          0          0
         TASKLET:     130047       2349       3275     543642       4004       3351       2698       4472
           SCHED:    2300923    1944712    1827287    1804593    1791739    1727666    1737655    1833934
         HRTIMER:          5          6          3          5          6       6931          2    4469550
             RCU:    1046053    1314967    1384321    1378346    1189830    1378833    1385584    1399476
        
    • The network and block device subsystems are the only two that are critical enough to warrant having dedicated softirqs

    • An interrupt handler must mark or raise a softirq to have it execute in the future

    • The kernel looks for raised IRQs to run every time a hardware interrupt completes, but also in the ksoftirqd kernel thread

    • It looks like both halves of interrupt handlers are not necessarily handled by the same CPU that issued the IO:

      # Both /proc/interrupts and /proc/softirqs show activity on CPU5 regardless of the CPU affinity of the process
      seq 500 | xargs -P16 -I{} taskset 0x1 curl -s -o /tmp/{}.txt https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
      
    • Adding a new softirq is fairly straightforward:

      diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
      index a92bce40b04b..73b7b0bfdfc6 100644
      --- a/include/linux/interrupt.h
      +++ b/include/linux/interrupt.h
      @@ -556,6 +556,7 @@ enum
         TASKLET_SOFTIRQ,
         SCHED_SOFTIRQ,
         HRTIMER_SOFTIRQ,
      +	TIM_IRQ_SOFTIRQ,
         RCU_SOFTIRQ,    /* Preferable RCU should always be the last softirq */
      
         NR_SOFTIRQS
      diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
      index eeceafaaea4c..6451d6b7163d 100644
      --- a/include/trace/events/irq.h
      +++ b/include/trace/events/irq.h
      @@ -20,6 +20,7 @@ struct softirq_action;
             softirq_name(TASKLET)		\
             softirq_name(SCHED)		\
             softirq_name(HRTIMER)		\
      +			softirq_name(TIM_IRQ)		\
             softirq_name_end(RCU)
      
       #undef softirq_name
      diff --git a/kernel/softirq.c b/kernel/softirq.c
      index 9f0aef8aa9ff..b84d125cfd5a 100644
      --- a/kernel/softirq.c
      +++ b/kernel/softirq.c
      @@ -62,7 +62,7 @@ DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
      
       const char * const softirq_to_name[NR_SOFTIRQS] = {
         "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL",
      -	"TASKLET", "SCHED", "HRTIMER", "RCU"
      +	"TASKLET", "SCHED", "HRTIMER", "TIM_IRQ", "RCU"
       };
      
       /*
      
    • And that gets it into /proc/softirqs:

        $ cat /proc/softirqs
                        CPU0       CPU1       
              HI:          0          0
           TIMER:        171        403
          NET_TX:          1          0
          NET_RX:         14          0
           BLOCK:        777         60
        IRQ_POLL:          0          0
         TASKLET:          1          0
           SCHED:        286        284
         HRTIMER:          0          0
         TIM_IRQ:          0          0
             RCU:       1285       1436
      
    • The list of softirqs in interrupt.h is ordered by priority:

    • softirq handlers are similar to interrupt handlers and can’t sleep/block

    • An interrupt handler raises a softirq with (how is data passed to the softirq handler?):

      raise_softirq(NET_TX_SOFTIRQ);
      
  • Tasklets

    • Deprecated as of 5.19 in favor of threaded IRQs (which I think are just workqueues, covered below)
    • Probably the bottom-half impl you want in most cases
    • Implemented via two softirqs (high & low pri): HI_SOFTIRQ and TASKLET_SOFTIRQ
    • Tasklets are scheduled, softirqs are raised (semantically equivalent)
    • Tasklets are scheduled via tasklet_struct instances, which contain function pointers and args (why don’t softirqs have this?)
      • Two linked lists store all the pending (and running) tasklet_struct instances (one for each priority)
      • Each instance has a count that’s used to ensure that only one handler for that tasklet is running at a given time
    • Tasklets can’t sleep either (wasn’t one of the motivations between splitting to top/bottom halves the ability to sleep/block?)
    • Tasklets run with interrupts enabled, and so must disable IRQ lines manually if data is shared between the halves
    • Scheduling a scheduled tasklet runs it once, not twice
  • ksoftirqd

    • softirqs are typically processed when returning from an interrupt handler and manually in critical code, like the networking subsystem
    • A large volume of softirqs can cause userspace to starve, especially when a softirq handler re-raises itself
    • Under heavy interrupt load, the ksoftirq family of threads (one pinned to each processor) is used to process softirqs at a high nice level (19), which lets the scheduler handle balancing this with userspace needs
  • Work Queues

    • Work is deferred into a kernel thread that runs in process context, so sleeping/blocking is allowed.

    • Use a bottom half if blocking isn’t required, or a work queue otherwise

    • One default worker thread per process, but the API allows creating task-specific worker threads (discouraged):

      ❯ ps fxao pid,ppid,cmd | grep -E 'kworker.*events'
            7       2  \_ [kworker/0:0H-events_highpri]
           28       2  \_ [kworker/1:0H-events_highpri]
           42       2  \_ [kworker/3:0H-events_highpri]
           49       2  \_ [kworker/4:0H-events_highpri]
           63       2  \_ [kworker/6:0H-events_highpri]
           70       2  \_ [kworker/7:0H-events_highpri]
       176580       2  \_ [kworker/1:2-events]
       318425       2  \_ [kworker/4:0-events]
       318628       2  \_ [kworker/7:0-events]
       322172       2  \_ [kworker/3:2-events]
       322808       2  \_ [kworker/2:0-events]
       323685       2  \_ [kworker/u16:3-events_power_efficient]
       327630       2  \_ [kworker/3:1-events]
       332768       2  \_ [kworker/u16:1-events_power_efficient]
       333081       2  \_ [kworker/4:3-events]
       334371       2  \_ [kworker/0:1-events]
       334823       2  \_ [kworker/5:2-events]
       334824       2  \_ [kworker/1:1-events]
       335382       2  \_ [kworker/6:1-events]
       335796       2  \_ [kworker/2:2-events]
       336648       2  \_ [kworker/u16:4-events_power_efficient]
       337523       2  \_ [kworker/7:1-events]
       337853       2  \_ [kworker/4:1-events]
       337854       2  \_ [kworker/5:0-events]
       338057       2  \_ [kworker/6:3-events]
       339395       2  \_ [kworker/0:0-events]
      
    • Work can either be scheduled with (run asap) or without (wait at least X ticks in the future) a delay

  • Kernel threads all have ppid 2, which is kthreadd

  • Is printk debugging in top/bottom halves impossible?

    • Use trace_printk instead, which is non-blocking

Chapter 9: Kernel Synchronization

  • Linux uses advisory locks: processes are expected to know what they’re doing, and the kernel doesn’t perform any verification that locks are held
  • Concurrency in the kernel
  • Not sufficient for userspace code and interrupt handlers to share a lock, because if userspace code is interrupted while holding a lock and then the interrupt handler attempts to acquire that same lock, deadlock! Acquiring the lock in userspace must be preceded by disabling interrupts.
  • As of Linux 2.6, kernel threads can be preempted by the scheduler
  • The kernel doesn’t provide recursive locks
  • Coordinated or nested locks must always be acquired in the same order

Chapter 10: Kernel Synchronization Methods

  • Linux provides a basic atomics API (ints and bitmasks) across all architectures

    • atomic_t is an atomic 32-bit integer, there’s also atomic64_t
      • atomic_read to convert an atomic_t to an int
        • Why is this necessary? Is it possible for a single integer to change out from under you during a read? I’m not really sure if the act of pulling a cache line is atomic, but reading a single word should be? 🤔
    • Atomic bitmasks use (void *) pointers and don’t have an associated atomic type
  • Spinlocks

    • Busy wait on a lock
    • Hold spin locks for as little time as possible
    • Only really makes sense when you expect to acquire the lock almost immediately
    • Or in places where you can’t sleep, like interrupt handlers/bottom halves
    • spin_lock_irqsave disables interrupts and acquires a spinlock
    • spin_lock_bh disabled bottom halves and acquires a spinlock
    • read-write spinlocks are available too for things like task lists
  • Semaphores

    • No need to hold up a CPU while locked, but more overhead (list of pending tasks, etc)

    • semaphores are not optimal for locks that are held for short periods because the overhead of sleeping, maintaining the wait queue, and waking back up can easily outweigh the total lock hold time.

    • Ok to sleep (or be interrupted) when holding a semaphore because any other process that attempts to acquire the semaphore will sleep instead of deadlocking

    • Binary (one lock holder) vs counting (configured number of lock holders) semaphores

    • Binary sempahore == mutex

    • Can attempt to acquire a kernel semaphore either in interruptible or uninterruptible sleep mode

    • read-write semaphores are available too

  • Mutexes

    • Logically equivalent to a semaphore with count == 1, but with a simpler interface and better perf
    • “A process cannot exit while holding a mutex” Does this mean that this scenario is disallowed? *

      Perhaps the most useful aspect of the new struct mutex is that, via a special debugging mode, the kernel can programmatically check for and warn about violations of these constraints.

  • Completion variables

  • Big Kernel Lock (BKL)

    • Created to help transition from Linux’s original SMP impl to a version with finer-grained locking
    • Sleeping while holding the BKL safe from deadlock, it’s automatically dropped when a task is preempted and reacquired when rescheduled
    • Recursive, ok for a process to reacquire while already held
    • Only usable in process context
    • Is a spinlock
    • Was removed in 2.6.39 (2011-05-18)
  • Sequential locks

    • Optimistic locking for read-heavy scenarios
    • Writes grab a lock and increment a counter both before and after the write is done
    • Reads only look at the counter before and after the read, and accept the read only the value is the same
  • Kernel code that needs to lock disables preemption instead

  • Barriers

    • Reads and writes to memory can be reordered by the compiler or even the processor
    • Intel x86 processors never reorder writes
    • volatile typically guards against this in userspace
    • rmb() ← read barrier, wmb() ← write barrier, mb() ← rw barrier
    • Reads or writes before the barrier will not be reordered to after the barrier

Chapter 11: Timers and Time Management

  • Periodic events are handled by the system timer

    • This is a piece of HW that fires an interrupt at a regular interval
    • Timer interrupts handle this and do things like update the system time
    • The kernel provides a high-resolution uptime figure to userspace that’s updated via the timer interrupt
    • Timer freq is 100Hz by default on x86, so once every 10ms (this feels slow!); set to 300 on my machine
      ❯ zcat /proc/config.gz | grep CONFIG_HZ=
      CONFIG_HZ=300
      
  • Why set a high HZ?

    • Error bars are (tick_interval / 2), so for a once-every-10ms timer, something that runs on every tick might have to wait up to 5ms if the timing is unlucky
    • The scheduler uses the timer interrupt to preempt tasks that hit their timeslice allocation (timeslice must be >= timer interval?)
  • Why set a low HZ?

    • More frequent interrupts, overhead
  • Tickless operation

    • If CONFIG_HZ is set, reschedule the timer interrupt based on pending timers
    • If no pending timers are set to go off for a duration > the tick interval, wait that long instead
    • Power/overhead savings on idle systems
  • Jiffies

    • Counter that’s incremented on every tick from boot
    • 32-bit overflow in 497 days for 100Hz, 40.7 days for 1000Hz
    • 64-bit overflow is reasonably far in the future even at 1000Hz
  • Use the real-time clock (non-volatile, hardware clock that uses the CMOS battery) to initialize wall-clock time on boot

  • Load averages are calculated on every tick

  • Scheduling: a process running during the timer interrupt is credited as having run for the entire duration since the last timer interrupt

    • A running process might’ve been preempted multiple times in that duration
    • All other processes that ran in the interval but are not running at interrupt time are unaccounted for
    • But the kernel can’t get more accurate than this
    • All processes face this unfairness so fairness is preserved
  • The timer API allows creating a timer with an expiration time (dynamic timers)

    • When the current tick count (jiffies) >= the expiration value, the timer fires
    • Working with the timer API to modify existing timers is unsafe, because a timer run can preempt the code modifying the timerA
    • Timers run in bottom-half context as softirqs after the timer interrupt completes

Chapter 12: Memory Management

  • The MMU typically deals with memory at page-level granularity, so the kernel does too
  • The kernel does use virtual memory, but kernel threads don’t, at least not directly
    • The kernel has a virtual address space just like any other process.
    • All physical memory is mapped into the kernel address space except for “high memory” that isn’t mappable
      • Some archs can virtually map less memory than the physical memory they allow
      • Memory that can’t be placed into kernel space because of this limitation is called “high memory”
      • High memory can only be mapped dynamically
      • https://www.kernel.org/doc/html/latest/vm/highmem.html
    • Is the kernel address space mapped 1:1 with physical memory?
    • Parts of this address space (what section?) are mapped into the initial VM range of each process' address space.
    • This mapping stays constant regardless of which process is running
    • This allows the kernel to preserve that VM region across context switches, so the TLB stays warm
    • Kernel threads (specifically threads, not all kernel code/processes) don’t have their own page tables, but just use the page table of the process that was running before them. They only use the initial address range that stays the same across context switches, so any page table is effectively identical to them
    • Kernel threads use kernel virtual memory, but without maintaining their own page tables
  • Each architecture defines its own page size (does this mean the type of MMU is always fixed on a Z299 board, for example?)
  • struct page in linux/mm_types.h
    • flags stores things like dirty status and whether the page is locked in memory (does :wthis mean there’s no backing store or there is one but it’s unavailable/full?), defined in page-flags.h
    • _count stores the ref count to this page
    • virtual is the kernel virtual address for the page, unless the page lives in the high memory area, in which case this is NULL
    • Every instance represents a physical page, mapped to the kernel’s address space
    • ~20 bytes each, so ~80MB for 16GB memory with 4KB pages
    • Are all struct page entries created eagerly?
  • Zones
    • Need to split the kernel address space into zones at all because:
      • Some devices can only perform DMA to some virtual addresses (smaller word size)
      • Some architectures can physically address a lot more memory than they can virtually address (like x86-32)
    • Four zones, defined per-architecture:
      • ZONE_DMA: Pages that can be used for DMA
      • ZONE_DMA32: Pages that can be used for DMA by 32-bit devices
      • ZONE_HIGHMEM: Addresses that can’t be addressed virtually
      • ZONE_NORMAL: Everything else
    • Allocations cannot cross zone boundaries, but zones are not strict
    • Normal allocations can come from ZONE_DMA or ZONE_NORMAL but not both
    • x86-64 doesn’t have a ZONE_HIGHMEM because the entire physical address space is virtually adressable
  • Allocating memory
    • Use functions in gfp.h (get free pages) to allocate memory in page-sized chunks

      • alloc_pages allocates contiguous physical pages
      • Use page_address to calculate the virtual address for a struct page
    • And kmalloc for more granular allocations

      • Flags to control behavior, like disabling sleep (for interrupt handlers), or picking a zone, or even priority levels
      • 600
      • GFP_KERNEL is the most common flag
      • Allocations are contiguous in physical memory
      • kfree to free kernel memory
    • vmalloc allocates memory that is contiguous in virtual memory *

      It does this by allocating potentially noncontiguous chunks of physical memory and “fixing up” the page tables to map the memory into a contiguous chunk of the logical address space.

      • Most kernel code uses kmalloc for performance; vmalloc has the overhead of modifying page table entries
      • vmalloc is used for larger allocations, like loading kernel modules
      • May sleep, can’t be called from interrupt context
      • vfree to free memory
  • Slab Allocator
    • “Generic data structure caching layer”
    • Frequently used data structures are allocated (and freed) often, so cache these centrally
    • These caches are allocated in contiguous physical memory to avoid fragmentation
    • Allocations can be made without locking with per-processor caches
    • The allocator can serve requests from the same NUMA zone
    • The linux slab allocator has one “cache” per object type, including the task_struct and inode structs
      • Each cache is divided into slabs, which is a group of pages contiguous in physical memory
      • Each slab contains a number of objects: instances of the data structure being cached
      • 500
    • Slabs are either full, partial, or empty, (instances of) the kmem_cache structure tracks these separately
    • Flags
      • SLAB_HWCACHE_ALIGN: align objects to cache lines
        • As far as I can tell this is about aligning objects smaller than a cache line to the start of a cache line, and ensuring that no other objects are present on the cache line
        • Not sure why this is important; is reading the first two bytes of a cache line cheaper than reading bytes 3-4? I think not!
        • Or maybe if the first object in a cache line is primarily used by a different core than the second, that creates a lot of HW cache thrashing?
        • This can create fragmentation, only use for high-perf code
      • SLAB_POISON: Fill the slab with a known garbage value to help catch memory access bugs
      • SLAB_RED_ZONE: Insert red zones around the allocated slab to catch buffer overflows
        • Are these just pages with all access perms disabled?A
  • High Memory
    • kmap and kunmap to map high memory (if all addressable memory is used by DMA & normal memory, what region is this mapped to?)
    • kmap_atomic and kunmap_atomic to temporarily map high memory into reserved slots in the addressable region
      • doesn’t sleep, so usable from interrupt context
      • only valid until the next temporary mapping, so this overwrites previous mappings if all slots are full

Chapter 13: VFS

  • All disk IO goes through the VFS, system call implementations only need to work against the VFS interface
  • 550
  • Directories are regular files that point to all the files they contain
  • “dentries” are path components, and are not the same as directories
  • File metadata is stored in an inode (except filename, which is stored in the parent directory)
  • The VFS is implemented in an object-oriented fashion in C, with “objects” and “methods” backed by structs
    • Four object types: superblock, inode, dentry, file
    • Each with an “operations” object that contains methods (function pointers)
  • Each “method” has to be explicitly passed a “this”, because C has no notion of an object
  • The superblock object describes a filesystem (struct super_block), and allows
    • Create/update/delete inodes, reacting to dirtied inodes (write journal entry)
    • Syncing in-memory metadata with the filesystem on disk
    • Mounting/remounting/unmounting
  • An inode object describes a file, and allows
    • Creating hardlinks, symlinks, and unlinking
    • Creating/deleting directories
    • Renaming
    • Truncating files
    • Setting file perms / attrs / extended attrs
  • A dentry object represents a path component, which is either a directory or a file
    • Each dentry is linked to an inode
    • dentries are either used (the inode’s refcount is positive), unused (the inode’s refcount is zero), or negative (the inode has been deleted)
    • dentries are cached to make path traversal fast (which is the whole point, otherwise inodes would be sufficient) via three caches
      • Each inode stores all the dentries that point to it
      • A global doubly linked LRU list of unused and negative dentries
      • A global hash table of dentries mapped by path
    • Negative dentries are kept around so that negative lookups aren’t slow
    • The dentry cache pins the associated inodes in memory, so they’re effectively cached too
  • A file object represents an open file
    • No “dirty” flag, not directly linked to on-disk data
    • A file object is linked to its dentry, which is linked to its inode, which is associated with on-disk data and does have a dirty flag
    • Stores things like the current offset within the file, the access mode, etc.
    • Operations: seek, read, aio_read, write, readdir, mmap, open, flush, fsync, flock
  • Per-process accounting
    • files_struct to keep track of open FDs
    • fs_struct to keep track of the cwd, root dir, etc.
    • namespace if a process is forked with a different namespace
    • Clone with CLONE_FILES or CLONE_FS to share the first two structs with the child
    • Clone with CLONE_NEWNS to not share the namespace struct with the child

Chapter 14: The Block IO Layer

  • Block devices: access fixed-size chunks of data
    • Hard disks, flash memory, CDROMs
    • As opposed to character devices, which use bytestreams, like keyboards
    • NICs are probably their own thing and aren’t block or character devices
    • Block devices support seeking
  • The smallest addressable unit on a block device is a sector
    • Devices can potentially operate on multiple sectors at a time, but not on sub-sectors
    • This is a hardware-specific limitation
  • The smallest addressable unit in software is a block
    • Can be no smaller than a sector, and needs to be an integer multiple of the sector size
    • The Linux kernel requires that the block size is a power of two
    • And that the block size is <= the page size
    • What if the sector size is something weird like 7KB?
  • Each block is fronted by an in-memory structure called a buffer
    • With a buffer_head to store metadata
    • Multiple buffers (potentially) per page
    • Buffers can be dirtied, locked, waiting for allocation or waiting on IO
  • struct bio represents an inflight IO operation for a given device
    • Buffers are split into segments, and segments can be non-contiguous in physical memory (why are buffers not granular enough?)
    • bio_vec instances track the location of each physical segment
    • 500
    • This allows combining disparate memory into single IO operations
    • Better than using buffer_head directly because it doesn’t need to split down to block granularity if not necessary
    • A single bio instance must deal with contiguous disk blocks but not (necessarily) contiguous memory
  • Request queues
    • All pending block IO requests to a device are put in a queue
    • Each request uses one or more bio structs
  • IO scheduling to optimize IO based on seek time/etc.
    • Multiple available schedulers, but generic wrt specific devices
    • In general, schedulers perform sorting (put requests near other requests to the same area) and merging (multiple requests to the same area can be combined)
    • Linus elevator
      • Every request is checked against every pending request for a possible merge
      • If this fails, insert into the queue at an optimal position
      • Possible for requests near the tail of the queue to starve
    • Deadline scheduler
      • It’s more likely that reads block userspace vs. writes, which are usually (big asterisk here!) async
      • Each request is associated with an expiration time, 500ms for reads, 5000ms for writes
      • Three queues, a sorted queue, and a FIFO queue each for reads and writes
      • Entries live on two queues
      • Pull from the sorted queue by default, but favor the FIFO queues if the entries at their respective heads have expired
      • Reads expire much quicker than writes, and so this works better for read-heavy workloads
    • Anticipatory scheduler
      • Deadline scheduler + the “anticipation heuristic”
      • The deadline scheduler prioritizes reads, so a lot of time is wasted servicing individual reads during a write-heavy workload
      • To mitigate this, when a seek (effectively) preempts heavy writing, service the read as before, but don’t seek back to the write location and start writing immediately
      • Instead, wait at the read location in anticipation of more reads (6ms by default)
    • Complete fair queueing IO scheduler
      • Per-process queues, sorting and merging per queue
      • Queues are serviced round-robin
      • Each process receives a fair slice of the available IO bandwidth
    • No-op scheduler
      • No sorting, only merging
      • Intended for block devices that support random access
      • Is this still the best scheduler for SSDs, or is there something better?
  • The complete-fair scheduler is used for block devices by default, overridable with boot-time kernel flags

Chapter 15: The Process Address Space

  • Every process is given a flat 32-bit or 64-bit address space, no segments

  • Threads are just processes that share an address space

  • The address space is grouped into areas that have different access perms (rwx) like:

    • The text section is an area with the current executable’s code, and the data section contains the executable’s global variables
    • A zero page is used to cheaply set uninitialized variables to zero
    • Text, data, and zero page sections for each dynamic/shared lib
    • Memory mapped files, shared memory segments, and/or anonymous malloc’d memory
  • struct mm_struct represents an address space

    • Two data structures representing the address spaces' memory areas: a linked list and a red-black tree
      • The nodes aren’t copied, so effectively this is a red-black tree with a linked list connecting its nodes
      • Allows for easy iteration and searching
    • This struct needs to be copied on fork
    • Kernel threads have no mm_struct because they only need to use kernel memory, which is mapped to every userspace address space
      • So any user address space works just as well for a kernel thread
      • It simply uses the mm_struct of the last executing process
  • struct vm_area_struct represents a memory area

    • Contiguous in virtual memory, and shares properties like rwx permissions
    • Each area can be set to grow downwards or upwards (or both?)
    • Areas can be set to disallow copying on fork
    • Areas can be set to never be swapped out
    • There’s a flag for areas with hugetlb pages
    • VM_SHARED for sharing (shared mapping if set, private mapping if not), VM_SHM for shared memory (IPC)
    • VM_IO for areas that are mapping of a device’s IO space
    • VM_SEQ_READ to hint to the kernel that the process is going to perform sequential reads in the area (prefetching, etc)
      • Set this via the madvise syscall
      • madvise POSIX flags: normal, random, sequential, will need (expect access in the near future), don’t need
      • madvise linux flags:
        • remove (zero out pages & backing store)
        • make available to child processes after forking (or don’t)
        • hw poison (handle subsequent references to this area like a hw memory corruption, used to test memory error-handling code)
        • mergeable/non-mergeable (mark this area as eligible for “Kernel Samepage Merging”, a mechanism to scan memory areas and merge areas with identical content and replace them with a single CoW page)
        • take region offline (also used in memory error-handling code)
        • enable disable huge pages
        • include/exclude area from core dumps
        • free region (area no longer necessary, but ok to wait until memory pressure to actually reclaim these pages, auto-cancel if a write to this area occurs in the interim)
        • wipe on fork
        • deactivate pages (more likely to be reclaimed under memory pressure)
        • pageout (reclaim pages, including flushing dirty pages)
  • columns in /proc/<pid>/maps output are:

    start-end permission offset major:minor inode file
    
    • for example:

      ❯ cat /proc/1942176/maps
      55ce50f8a000-55ce50f8b000 r--p 00000000 103:02 9176058                   /home/tim/dev/mem
      55ce50f8b000-55ce50f8c000 r-xp 00001000 103:02 9176058                   /home/tim/dev/mem
      55ce50f8c000-55ce50f8d000 r--p 00002000 103:02 9176058                   /home/tim/dev/mem
      55ce50f8d000-55ce50f8e000 r--p 00002000 103:02 9176058                   /home/tim/dev/mem
      55ce50f8e000-55ce50f8f000 rw-p 00003000 103:02 9176058                   /home/tim/dev/mem
      55ce51f77000-55ce51f98000 rw-p 00000000 00:00 0                          [heap]
      7f02ca18a000-7f02ca18c000 rw-p 00000000 00:00 0 
      7f02ca18c000-7f02ca1b4000 r--p 00000000 103:02 13372809                  /usr/lib/libc.so.6
      7f02ca1b4000-7f02ca32e000 r-xp 00028000 103:02 13372809                  /usr/lib/libc.so.6
      7f02ca32e000-7f02ca386000 r--p 001a2000 103:02 13372809                  /usr/lib/libc.so.6
      7f02ca386000-7f02ca38a000 r--p 001f9000 103:02 13372809                  /usr/lib/libc.so.6
      7f02ca38a000-7f02ca38c000 rw-p 001fd000 103:02 13372809                  /usr/lib/libc.so.6
      7f02ca38c000-7f02ca39b000 rw-p 00000000 00:00 0 
      7f02ca3c1000-7f02ca3c3000 r--p 00000000 103:02 13372799                  /usr/lib/ld-linux-x86-64.so.2
      7f02ca3c3000-7f02ca3ea000 r-xp 00002000 103:02 13372799                  /usr/lib/ld-linux-x86-64.so.2
      7f02ca3ea000-7f02ca3f5000 r--p 00029000 103:02 13372799                  /usr/lib/ld-linux-x86-64.so.2
      7f02ca3f6000-7f02ca3f8000 r--p 00034000 103:02 13372799                  /usr/lib/ld-linux-x86-64.so.2
      7f02ca3f8000-7f02ca3fa000 rw-p 00036000 103:02 13372799                  /usr/lib/ld-linux-x86-64.so.2
      7ffe1079c000-7ffe107bd000 rw-p 00000000 00:00 0                          [stack]
      7ffe107f1000-7ffe107f5000 r--p 00000000 00:00 0                          [vvar]
      7ffe107f5000-7ffe107f7000 r-xp 00000000 00:00 0                          [vdso]
      ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
      
    • pmap -x includes RSS for each area (how much of this area is fronted by physical memory)

  • Page tables

    • Three-level page table allows sparse address spaces to exist without much overhead
      • Level 1: page global directory
      • Level 2: page middle directory
      • Level 3: page table entries
      • 600
    • TLBs for faster lookups
    • As of 2.6, page tables were copied on fork, but there were plans to CoW page tables too (was this done?)

Chapter 16: The Page Cache and Page Writeback

  • Write to /proc/sys/vm/drop_caches to empty the page cache
  • Write strategies
    • No-write: writes aren’t cached. Writes invalidate cache entries and are applied to disk in-band
    • Write-through: writaes update the cache and the disk in-band
    • Write-back: writes only update the cache, dirty pages are flushed to disk out-of-band
  • The linux page cache is a write-back cache
  • Only clean pages can be evicted, a flush may be forced if not enough clean pages are present (so memory pressure could directly translate to IO pressure)
  • Eviction uses a modified version of LRU
    • Two lists, active and inactive
    • Pages on the inactive list are available for eviction, pages on the active list aren’t
    • Pages first go onto the inactive list, and only move to the active list when they are accessed while on the inactive list
    • Each list is a queue: new pages added on one end, evictions happen on the other
    • If the active list grows too large, elements are “evicted” back onto the inactive list
    • (Sounds like) No real LRU except when moving between the two lists - multiple accesses don’t move the list around in the queue
  • A page cache entry contains of multiple (possibly) noncontiguous disk blocks
    • So indexing can’t be done by block ID
    • The linux page cache wants to cache mmap’ed regions, so inode isn’t sufficient either
    • Use a struct address_space instead (badly named), one instance of which exists for a given file
      • Contains an operations struct so each filesystem type can implement its own interactions with the page cache
      • The page cache is indexed by (address_space, offset into the file in pages)
      • address_space includes a radix tree to find a physical page by file offset
        • this isn’t an array lookup because the page cache doesn’t necessarily cache a file in contiguous physical memory
      • How are address_space instances looked up? By file path?
  • Buffer cache
    • Buffers are in-memory representations of single disk blocks
    • The page cache also caches disk blocks (reads and writes)
    • Unified into the main page cache - this was previously a separate “buffer cache” that duplicated data in the page cache
    • Buffers now simply contain a block→page mapping, which targets a page in the page cache
    • The buffer cache sits between the filesystem and the device driver, and the page cache sits between the VFS and the filesystem
    • Does a write create a dirty page cache page AND a dirty buffer entry, or does the latter only happen when the page cache is flushed?
    • Why do we need a block cache at all?
  • Flushing dirty data
    • When free memory drops below a threshold, when dirty data grows older than a threshold, or on an explicit fsync
    • Flusher threads are woken up either when memory is low, but also on a timer.
    • 500
    • Multiple threads to allow saturating multiple disks (or queues, like for SSDs) in parallel

Chapter 17: Devices & Modules

  • Three devices types: block, character, network
  • Modules
    • C programs that use module_init and module_exit to set up entry/exit points

    • Using headers in linux/

    • Hello world module

      • Write module and Makefile to compile against running kernel

      • This creates a .ko file, compress with xz hello.ko

      • Copy it to the right place: sudo cp hello.ko.xz /usr/lib/modules/5.17.9-arch1-1/extramodules/

      • Reload: sudo depmod

      • Load the module: sudo modprobe hello

      • And unload: sudo modprobe -r hello

      • Look in dmesg for the output

      • Here’s the module itself:

        #include <linux/module.h>
        #include <linux/kernel.h>
        
        MODULE_LICENSE("GPL");
        
        static int hello_init(void) {
            printk("HELLO!\n");
            return 0;
        }
        
        static void hello_cleanup(void) {
            printk("BYE!\n");
        }
        
        module_init(hello_init);
        module_exit(hello_cleanup);
        
      • And the Makefile:

        obj-m += hello.o
        
        all:
        	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
        
        clean:
        	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
        
    • Modules can be loaded conditionally based on config flags, and can also be parameterized

    • Other fun modules incl. the rickroll module

  • sysfs
Edit