Linux Memory Management at Scale

  • Many types of memory from the OS PoV:
    • anonymous: no backing store
    • mmap’ed files
    • file caches & buffers: unified page cache
  • cgroupv2
    • Limiting one resource can cause surprising effects
    • Limiting memory can cause IO spikes as pages are evicted
    • mmap_sem is a kernel lock that must be held for any kernel virtual memory op, if one process holds it for longer (due to a cgroup limit, say), all other processes waiting to reference memory are delayed too
    • ext4 journal entries are flushed to disk in batches, and batches that contain IO from a high and low priority groups will only go as fast as the lower priority group allows
  • RSS (resident set size) is not a great measurement because it ignores caches entirely
    • cgroupv2 limits all types of memory
  • Swap has an (unfairly) bad reputation, it isn’t “emergency RAM” but a backing store for anon pages
    • https://bit.ly/whyswap
    • Anonymous pages can’t be evicted if there’s no swap, so other (possibly hotter) pages are evicted instead
    • More gradual path to running out of memory, without swap it’s a lot more binary
    • mlock locks pages in RAM and prevents them from being swapped out
  • “Memory access information is hidden behind the MMU”
    • Is this only true for TLB hits or in general?
    • Linux has no idea when you’re about to run out of memory, only when it tries to reclaim a page and fails
    • The OS knows how many pages are resident in memory, but not how many could easily be reclaimed if necessary, right away
    • Linux could track this but the overhead would make it too expensive
  • OOM killer
    • Reactive to reclaim failure, invoked too late
    • Kills the wrong thing
  • Reclamation
    • kswapd reclaim: bg kernel thread that frees memory when the number of pages in memory goes over a threshold
    • direct reclaim: initiated when a process requests memory and no pages are available (process is suspended)
    • Overhead: dirty pages need to be flushed, anonymous pages have to be written to swap, etc.
  • What metric best approximates memory usage?
    • free memory: not good enough because we want to use as much memory as is available, but a lot of these pages could be reclaimable
    • free memory - buff/cache: not all buffers/caches are reclaimable, so this could swing too low
    • page scan rate: the rate at which the OS is scanning pages to find reclaimable pages, a high rate means the OS is having to scan quite far down the free list/etc. to find reclaimable pages
      • hard to disambiguate oversubscription vs. efficient memory use
  • psi metrics
  • oomd
    • userspace OOM killer with more granular priority definitions
  • Prefer protection (lower bounds for memory) over limits (higher bounds for memory)
    • Limits are difficult to predefine, sometimes heavily specific to workload
    • No easy setting for variable/multimodal workloads
    • If memory usage dips below the lower bound no reclamation occurs for threads in that cgroup
    • 500
    • 500
  • Kernel tunables
    • Writeback throttling: flushing dirty pages can’t be stopped once started, so don’t invoke it as often under contention
    • vm.swappiness
  • 500
  • https://facebookmicrosites.github.io/cgroup2/docs/fbtax-results.htm
Edit