Linux Memory Management at Scale

https://www.youtube.com/watch?v=cSJFLBJusVY
Many types of memory from the OS PoV:
- anonymous: no backing store
- mmap’ed files
- file caches & buffers: unified page cache
cgroupv2
- Limiting one resource can cause surprising effects
- Limiting memory can cause IO spikes as pages are evicted
- mmap_sem is a kernel lock that must be held for any kernel virtual memory op, if one process holds it for longer (due to a cgroup limit, say), all other processes waiting to reference memory are delayed too
- ext4 journal entries are flushed to disk in batches, and batches that contain IO from a high and low priority groups will only go as fast as the lower priority group allows
RSS (resident set size) is not a great measurement because it ignores caches entirely
- cgroupv2 limits all types of memory
Swap has an (unfairly) bad reputation, it isn’t “emergency RAM” but a backing store for anon pages
- https://bit.ly/whyswap
- Anonymous pages can’t be evicted if there’s no swap, so other (possibly hotter) pages are evicted instead
- More gradual path to running out of memory, without swap it’s a lot more binary
- mlock locks pages in RAM and prevents them from being swapped out
“Memory access information is hidden behind the MMU”
- Is this only true for TLB hits or in general?
- Linux has no idea when you’re about to run out of memory, only when it tries to reclaim a page and fails
- The OS knows how many pages are resident in memory, but not how many could easily be reclaimed if necessary, right away
- Linux could track this but the overhead would make it too expensive
OOM killer
- Reactive to reclaim failure, invoked too late
- Kills the wrong thing
Reclamation
- kswapd reclaim: bg kernel thread that frees memory when the number of pages in memory goes over a threshold
- direct reclaim: initiated when a process requests memory and no pages are available (process is suspended)
- Overhead: dirty pages need to be flushed, anonymous pages have to be written to swap, etc.
What metric best approximates memory usage?
- free memory: not good enough because we want to use as much memory as is available, but a lot of these pages could be reclaimable
- free memory - buff/cache: not all buffers/caches are reclaimable, so this could swing too low
- page scan rate: the rate at which the OS is scanning pages to find reclaimable pages, a high rate means the OS is having to scan quite far down the free list/etc. to find reclaimable pages
  - hard to disambiguate oversubscription vs. efficient memory use
psi metrics
- kernel/cgroupv2 feature to measure the % of time threads in a cgroup are blocked due to a lack of memory
- refaulting (loading a page that was just swapped out, sounds like) backing pages, waiting for a kernel memory lock, waiting for a reclaim, waiting for a dirty page flush, etc.
- this is called memory.pressure and approximates well as the % speedup you’re likely to see if you had more memory
- https://facebookmicrosites.github.io/cgroup2/docs/pressure-metrics.html
- https://facebookmicrosites.github.io/psi/docs/overview.html#pressure-metric-definitions
oomd
- userspace OOM killer with more granular priority definitions
Prefer protection (lower bounds for memory) over limits (higher bounds for memory)
- Limits are difficult to predefine, sometimes heavily specific to workload
- No easy setting for variable/multimodal workloads
- If memory usage dips below the lower bound no reclamation occurs for threads in that cgroup
Kernel tunables
- Writeback throttling: flushing dirty pages can’t be stopped once started, so don’t invoke it as often under contention
- vm.swappiness
https://facebookmicrosites.github.io/cgroup2/docs/fbtax-results.htm