Hastening process cleanup with process_mrelease()

https://lwn.net/Articles/864184/

  • During high memory contention, the kernel might kill a process to make some memory available
  • The killed process is responsible for cleaning itself up, though, which could be slow, or in some cases blocked:

If, however, the killed process finds itself blocked in an uninterruptible sleep, that cleanup work could be delayed indefinitely. There are other factors that can slow down the freeing of memory, including how busy the relevant CPU is and whether that CPU is running in a slow, low-power state.

  • The OOM reaper eas created in 2015 as a solution to this problem; process cleanup is handled by a kernel thread.

  • The big issue with the OOM killer in general is that it’s fairly arbitrary about picking the process that should be killed. Priority can be set with /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj, but this is presumably not good enough.

  • Facebook’s oomd is a userspace OOM killer that allows for this sort of customization before the kernel OOM killer ever kicks in.

  • Userspace OOM killers can’t use the OOM reaper though, and we’re back in the situation of waiting for the killed process to clean its memory up.

  • This new syscall has been proposed to fix this problem:

    int process_mrelease(int pidfd, unsigned int flags);
    
  • This call invokes the OOM reaper on the process represented by pidfd

  • Reaping isn’t performed by a kernel thread though; this happens in the context of the process that invokes this syscall (not necessarily the process the killed the process).

  • An alternative that was discussed with an earlier attempt to solve this problem was to just unconditionally reap the memory of a process when it is killed, without requiring a separate system call to make that happen. In that case, though, the work would be done in the context of the process sending the signal, which might not be welcome. A process that kills a lot of other ones — a killall command, for example — could be significantly slowed if that policy were to be adopted. Adding a separate system call gives user space more control over when and how that work is done.

  • From HN: earlyoom is an alternative to oomd

Edit