cgroups
- https://facebookmicrosites.github.io/cgroup2
- LISA21 - 5 Years of Cgroup v2: The Future of Linux Resource Control: https://www.youtube.com/watch?v=kPMZYoRxtmg
- Linux Page Cache Mini Book
- cgroups provide isolated resources
- resource types
- blkio — this subsystem sets limits on input/output access to and from block devices such as physical drives (disk, solid state, or USB).
- cpu — this subsystem uses the scheduler to provide cgroup processes access to the CPU. cpuacct — this subsystem generates automatic reports on CPU resources used by processes in a cgroup.
- cpuset — this subsystem assigns individual CPUs (on a multicore system) and memory nodes to processes in a cgroup.
- devices — this subsystem allows or denies access to devices by processes in a cgroup.
- freezer — this subsystem suspends or resumes processes in a cgroup.
- memory — this subsystem sets limits on memory use by processes in a cgroup and generates automatic reports on memory resources used by those processes.
cgcreate
to create cgroups,cgexec
to run a program in a cgroup- hierarchical, both for limits and permissions
- all processes in a cgroup share its resources
- resource types
- cgroupv1 vs cgroupv2
- hierarchy
-
v1: /sys/fs/cgroup/
/<cgroup_name>/<sub_cgroup_name>/ -
v2: /sys/fs/cgroup/<cgroup_name>/<sub_cgroup_name>/
-
v1 required a hierarchy per resource, v2 only requires a hierarchy per cgroup, and uses the
subtree_control
file to enable/disable resource types for a subtree
-
- v1 allowed different threads in different cgroups, v2 works at a thread-group level instead
- for writes, v1 charged the correct cgroup for the memory access to dirty a page, but the root cgroup was always charged for the writeback to disk, which is obviously incorrect
- v2 fixes this by having each page be charged to the cgroup that first placed it into the page cache (also covered in Linux Page Cache Mini Book)
- This wasn’t possible in cgroupv1 because memory and io used different cgroups, so this sort of continuity was impossible
- hierarchy
- a process can be in one cgroup at a time
cgroup.controllers
to list available controller types,cgroup.subtree_control
to list (and modify) available controller types for children- Pages in the page cache are not necessarily evictable, so RSS (which skips the page/buffer/swap caches) isn’t really useful as a metric
memory.current
for a cgroup is more accurate, and includes everything: page cache, slab pages, socket memory, etc. This is accurate but might be spiky/etc.
- How do you go about picking good numbers for memory limits?
io.latency
is a better IO knob thanio.max
, which can limit things too far on an uncontended system, or be made irrelevant based on the actual workloadio.latency
: > Quality of service mechanism to guarantee a cgroup’s level of IO completion latency. Specifies the number of milliseconds a process can wait before IO from other processes is given to it. If the average completion latency is longer than the target set here, other processes are throttled to provide more IO, effectively prioritizing the job with the lowestio.latency
setting.- Only sibling cgroups compete for IO via
io.latency
- Can cause thrashing when multiple important workloads are on the same machine