cgroups


  • cgroups provide isolated resources
    • resource types
      • blkio — this subsystem sets limits on input/output access to and from block devices such as physical drives (disk, solid state, or USB).
      • cpu — this subsystem uses the scheduler to provide cgroup processes access to the CPU. cpuacct — this subsystem generates automatic reports on CPU resources used by processes in a cgroup.
      • cpuset — this subsystem assigns individual CPUs (on a multicore system) and memory nodes to processes in a cgroup.
      • devices — this subsystem allows or denies access to devices by processes in a cgroup.
      • freezer — this subsystem suspends or resumes processes in a cgroup.
      • memory — this subsystem sets limits on memory use by processes in a cgroup and generates automatic reports on memory resources used by those processes.
    • cgcreate to create cgroups, cgexec to run a program in a cgroup
    • hierarchical, both for limits and permissions
    • all processes in a cgroup share its resources
  • cgroupv1 vs cgroupv2
    • hierarchy
      • v1: /sys/fs/cgroup//<cgroup_name>/<sub_cgroup_name>/

      • v2: /sys/fs/cgroup/<cgroup_name>/<sub_cgroup_name>/

      • v1 required a hierarchy per resource, v2 only requires a hierarchy per cgroup, and uses the subtree_control file to enable/disable resource types for a subtree

    • v1 allowed different threads in different cgroups, v2 works at a thread-group level instead
    • for writes, v1 charged the correct cgroup for the memory access to dirty a page, but the root cgroup was always charged for the writeback to disk, which is obviously incorrect
      • v2 fixes this by having each page be charged to the cgroup that first placed it into the page cache (also covered in Linux Page Cache Mini Book)
      • This wasn’t possible in cgroupv1 because memory and io used different cgroups, so this sort of continuity was impossible
  • a process can be in one cgroup at a time
  • cgroup.controllers to list available controller types, cgroup.subtree_control to list (and modify) available controller types for children
  • Pages in the page cache are not necessarily evictable, so RSS (which skips the page/buffer/swap caches) isn’t really useful as a metric
    • memory.current for a cgroup is more accurate, and includes everything: page cache, slab pages, socket memory, etc. This is accurate but might be spiky/etc.
  • How do you go about picking good numbers for memory limits?
    • senpai: 200
    • lower memory.high (which causes throttling when exceeded, not OOM) until memory pressure (PSI) starts going up, then back off a bit
  • io.latency is a better IO knob than io.max, which can limit things too far on an uncontended system, or be made irrelevant based on the actual workload
Edit