Limiting one resource can cause surprising effects
Limiting memory can cause IO spikes as pages are evicted
mmap_sem is a kernel lock that must be held for any kernel virtual memory op, if one process holds it for longer (due to a cgroup limit, say), all other processes waiting to reference memory are delayed too
ext4 journal entries are flushed to disk in batches, and batches that contain IO from a high and low priority groups will only go as fast as the lower priority group allows
RSS (resident set size) is not a great measurement because it ignores caches entirely
cgroupv2 limits all types of memory
Swap has an (unfairly) bad reputation, it isn’t “emergency RAM” but a backing store for anon pages
Anonymous pages can’t be evicted if there’s no swap, so other (possibly hotter) pages are evicted instead
More gradual path to running out of memory, without swap it’s a lot more binary
mlock locks pages in RAM and prevents them from being swapped out
“Memory access information is hidden behind the MMU”
Is this only true for TLB hits or in general?
Linux has no idea when you’re about to run out of memory, only when it tries to reclaim a page and fails
The OS knows how many pages are resident in memory, but not how many could easily be reclaimed if necessary, right away
Linux could track this but the overhead would make it too expensive
OOM killer
Reactive to reclaim failure, invoked too late
Kills the wrong thing
Reclamation
kswapd reclaim: bg kernel thread that frees memory when the number of pages in memory goes over a threshold
direct reclaim: initiated when a process requests memory and no pages are available (process is suspended)
Overhead: dirty pages need to be flushed, anonymous pages have to be written to swap, etc.
What metric best approximates memory usage?
free memory: not good enough because we want to use as much memory as is available, but a lot of these pages could be reclaimable
free memory - buff/cache: not all buffers/caches are reclaimable, so this could swing too low
page scan rate: the rate at which the OS is scanning pages to find reclaimable pages, a high rate means the OS is having to scan quite far down the free list/etc. to find reclaimable pages
hard to disambiguate oversubscription vs. efficient memory use
psi metrics
kernel/cgroupv2 feature to measure the % of time threads in a cgroup are blocked due to a lack of memory
refaulting (loading a page that was just swapped out, sounds like) backing pages, waiting for a kernel memory lock, waiting for a reclaim, waiting for a dirty page flush, etc.
this is called memory.pressure and approximates well as the % speedup you’re likely to see if you had more memory