prevent system freeze/unresponsiveness due to swapping run away memory usage

If a process demands a lot of memory, the system moves all other process to the swap file. Including it seems, necessary processes like the X11 server or the terminal.

So if a process keeps allocating without limit, everything becomes unresponsive, till that process is killed by the OOM-killer. My laptop seems to be especially sensible and reacts extremely badly. I just spent an ENTIRE HOUR waiting for the process termination during which not even the mouse cursor could be moved.

How can this be avoided?

1) Disable the swap => I often start a lot of processes that then become inactive. The inactive ones should be moved to the swap.

2) Get an SSD => too expensive

3) set a maximum memory ulimit => but then it fails in cases a program needs a resonable, large amount of memory. the problem is not that it uses too much, but that it suppresses the other processes

4) keep important programs (X11, bash, kill, top, ...) in memory and never swap those => can this be done? how? perhaps only swap large programs?

5) ?


TL;DR

Short answer

  • Easiest: Use a newer Linux release where systemd-oomd became part of systemd in the 2nd half of 2020.
    • Otherwise, have a smaller swap partition and avoid the kernel trying to live up to the lie that there is no memory limit by running processes from slow storage instead of RAM.
    • With a big swap, the traditional kernel space OOM (out of memory manager) did't take action soon enough. Typically, it accounts according to virtual memory and, in my past experience (with Ubuntu 16.04), it didn't kill things until the entire swap got filled up. Hence the thrashing and crawling system...
  • Need a big swap for hibernate and stuck with an older version of systemd without systemd-oomd baked in?
    • To test: try adding oomd separately on systems with cgroup2 and psi available (e.g. Ubuntu 20.04, which lacks systemd-oomd). OOMD is "a userspace out-of-memory killer" that looks at active physical memory use pressure via newer Pressure Stall Information (PSI) metrics. Hence it should act quicker than the kernel OOM and act before the system is overextended into extreme swap usage.
  • Stuck on something old without cgroup2?
    • Attempted/problematic: Set some ulimits (e.g. check ulimit -v, and maybe set a hard or soft limit using the as option in limits.conf). This used to work well enough, but thanks to WebKit introducing gigacage, many gnome apps now expect unlimited address spaces and fail to run!
    • Attempted/problematic: The overcommit policy and ratio is another way to try manage and mitigate this (e.g. sysctl vm.overcommit_memory, sysctl vm.overcommit_ratio, but this approach didn't work out for me.
    • Difficult/complicated: Try apply a cgroup priority to the most important processes (e.g. ssh), but this currently seems cumbersome for cgroup v1 (cgroup v2 is meant to make it easier)...

I also found:

  • Another stack exchange entry that corroborates the above advice for smaller swap spaces.
  • Other than systemd-oomd, you could try something like thrash-protect as an alternative.

Long answer and background

By 2021 several Linux kernel improvements (cgroups v2, psi, better page writeback throtteling), as well as systemd-oomd, have made their way into the newer distro releases.

Note that many of the cgroup resource control improvements and the systemd support for it are not leveraged by default. For desktop use cases it requires manual end user tuning.

Traditional Linux distro companies that have full time dev teams, e.g. RedHat (CentOS/Fedora), SUSE, or Ubuntu (Debian) likely focus on sever-side use cases as a higher priority than desktop responsiveness. Perhaps the Android ecoystem is a more likely place to pull Linux towards end-user computing device responsiveness needs, but it won't help much given the init used on android is not systemd, and so beyond the Kernel, there's not much shared benifit. Also, the kernel packaged for Android is likely configured and built differently.

Some patches of interest:

  • Make background writeback not suck used to an issue
    • The block: hook up writeback throttling patch helps avoids IO contention between background and foreground IO demand.
  • mm, oom: introduce oom reaper

So it's not just bad user space code and distro config/defaults that's at fault - older kernel's could handle this better, and newer kernel's do, although kernel OOM is still not ideal.

Besides the effort to improve Kernel OOM, there's also userspace out of memory manager developments with systemd-oomd. Unforutnatly, Ubuntu 20.04 LTS didn't inlcude it as they release with and older version of systemd. See Systemd 247 Merges Systemd-OOMD For Improving Low-Memory/Out-Of-Memory Handling.

Comments on options already considered

  1. Disable the swap

Providing at least a small swap partition is recommended (Do we really need swap on modern systems?). Disabling swap not only prevents swapping out unused pages, but it might also affect the kernel's default heuristic overcommit strategy for allocating memory (What does heuristics in Overcommit_memory =0 mean?), as that heuristic does count on swap pages. Without swap, overcommit can still probably work in the heuristic (0) or always (1) modes, but the combination of no swap and the never (2) overcommit strategy is likely a terrible idea. So in most cases, no swap will likely hurt performance.

E.g., think about a long running process that initially touches memory for once-off work, but then fails to release that memory and keeps running the background. The kernel will have to use RAM for that until the process ends. Without any swap, the kernel can't page it out for something else that actually wants to actively use RAM. Also think about how many devs are lazy and don't explicitly free up memory after use.

  1. set a maximum memory ulimit

It only applies per process, and a it's probably a reasonable assumption that a process shouldn’t request more memory that a system physically has! So it's probably useful to stop a lone crazy process from triggering thrashing while still being generously set.

  1. keep important programs (X11, bash, kill, top, ...) in memory and never swap those

Nice idea, but then those programs will hog memory they're not actively using. It may be acceptable if the program only requests a modest amount of memory.

systemd 232 release added some options that make this possible: I think one could use the 'MemorySwapMax=0' to prevent a unit (service) like ssh having any of it's memory swapped out.

Nonetheless, being able to prioritise memory access would be better.

OOM, ulimit and trading off integrity for responsiveness

Swap thrashing (when the working set of memory, i.e. pages being read and writing to in a given short time-frame exceeds the physical RAM) will always lockup storage I/O - no kernel wizardry can save a system from this without killing a process or two...

There were hopes that Linux kernel OOM tweaks coming along in more recent kernels would recognise that the working set memory pressure exceeds physical memory, and in this situation, identify and kill the most appropriate process, i.e. the one greedily using the most memory. "Workinig set" memory is the active/in-use memory that processes are depending on an accessing frequently.

The traditional kernel OOM did not recognise the situation when the working set exceeded the physical memory (RAM), and so the thrashing problem happened. A big swap partition made the problem worse as it could look as if the system still had virtual memory headroom while the kernel merrily over commits and severed more memory requests, yet the working set could spill over into swap, effectively trying to treat storage as if it's RAM.

On servers, kernel OOM seemed to accept the performance penalty of thrashing for a determined, slow, don't lose data, trade-off. On desktops, the trade off is different and users would prefer a bit of data loss (process sacrifice) to keep things responsive.

This was a a nice comical analogy about OOM: oom_pardon, aka don't kill my xlock

Incidentally, OOMScoreAdjust is another systemd option to help weight and avoid OOM killing processes considered more important.

OOMD, now baked into systemd as systemd-oomd can move this problem out of kernel space and into user space.

buffered writeback

"Make background writeback not suck" explained how the Linux "Page Cache" (block storage buffering) would cause some issues when the cache was being flushed to storage. A process requesting too much RAM could force the need to free memory from the page cache triggering a large page write-back from memory to storage. Unrestrained write-back would get too much priority on IO and cause IO wait (blocking) for other processes. And if the kernel ran out of page cache to sacrifice, swapping out other process memory (more write to disk) would continue to compete for IO, continuing the IO contention and starving other processes from storage access. It's not the cause thrashing problem itself, but it does add to the overall degradation in responsiveness.

Thankfully, with newer kernels, block: hook up writeback throttling should limit the impact of page cache being written back to storage. I'd assume this also applies to background process memory swap out to storage as well.

ulimits limitation

One problem with ulimits is that the accounting an limit applies to the virtual memory address space (which implies combining both physical and swap space). As per man limits.conf:

       rss
          maximum resident set size (KB) (Ignored in Linux 2.4.30 and
          higher)

So setting a ulimit to apply just to physical RAM usage doesn't look usable anymore. Hence

      as
          address space limit (KB)

seems to be the only respected tunable.

Unfortunately, as detailed more by the example of WebKit/Gnome, some applications can't run if virtual address space allocation is limited.

cgroups can help

Currently, it seems cumbersome, but possible to enable some kernel cgroup flags cgroup_enable=memory swapaccount=1 (e.g. in grub config) and then try use the cgroup memory controller to limit memory use.

cgroups have more advanced memory limit features then the 'ulimit' options. CGroup v2 notes hint at attempts to improve on how ulimits worked.

The combined memory+swap accounting and limiting is replaced by real control over swap space.

CGroup options can be set via systemd resource control options. E.g.:

  • MemoryHigh
  • MemoryMax

Other useful options might be

  • IOWeight
  • CPUShares

These have some drawbacks:

  1. Overhead. Docker documentation briefly mentioned 1% extra memory use and 10% performance degradation (probably with regard to memory allocation operations - it doesn't really specify).
  2. Cgroup/systemd stuff was heavily re-worked a while ago, so the flux upstream implied Linux distro vendors were waiting for it to settle first.

In CGroup v2, they suggest that memory.high should be a good option to throttle and manage memory use by a process group. However this quote from 2015 suggested that monitoring memory pressure situations needed more work:

A measure of memory pressure - how much the workload is being impacted due to lack of memory - is necessary to determine whether a workload needs more memory; unfortunately, memory pressure monitoring mechanism isn't implemented yet.

In 2021, seems that work has been completed. See references where Facebook did some work and analysis:

  • cgroup2: Memory Controller
  • cgroup2: PSI Pressure Metrics

Given systemd and cgroup user space tools are complex, in the past there was not a simple way to set something appropriate and so I didn't leverage it further. The cgroup and systemd documentation for Ubuntu isn't great.

Finally newer desktop editions will likely leverage cgroups and systemd-oomd so that under high memory pressure, thrashing doesn't start. However, even more future work could be done to ensure ssh and the X-Server/window manager components get higher priority access to CPU, physical RAM and storage IO, and avoid competing less important processes. The kernel's CPU and I/O priority features have been around for a while. It seems to be priority access to physical RAM that's was lacking, but have now been fixed with cgroups v2.

For Ubuntu 20.04, the CPU and IO priorities are not set at all. When I checked the systemd cgroup limits, CPU shares, etc, as far as I could tell, Ubuntu 16.04 and now even 20.04 hasn't baked in any pre-defined prioritisations. E.g. I ran:

systemctl show dev-mapper-Ubuntu\x2dswap.swap

I compared that to the same output for ssh, samba, gdm and nginx. Important things like the GUI and remote admin console have to fight equally with all other processes if thrashing happens.

I ran:

grep -r MemoryMax /etc/systemd/system /usr/lib/systemd/system/

And found no units apply any MemoryMax limits by default for unit files. So nothing is limited by default and the systemd cgroup memory limit config needs to be explicitly configured by the sysadmin.

Example memory limits I have on a 16GB RAM system

I wanted to enable hibernate, so I needed a big swap partition. Hence attempting to mitigate with ulimits, etc.

ulimit

I put * hard as 16777216 in /etc/security/limits.d/mem.conf such that no single process would be allowed to request more memory than is physically possible. I won't prevent thrashing all together, but without, just a single process with greedy memory use, or a memory leak, can cause thrashing. E.g. I've seen gnome-contacts suck up 8GB+ of memory when doing mundane things like updating the global address list from an exchange server...

Gnome contacts chewing up RAM

As seen with ulimit -S -v, many distros have this hard and soft limit set as 'unlimited' given, in theory, a process could end up requesting lots of memory but only actively using a subset, and run happily thinking it's been given say 24GB of RAM while the system only has 16GB. The above hard limit will cause processes that might have been able to run fine to abort when the kernel denies their greedy speculative memory requests.

However, it also catches insane things like gnome contacts and instead of loosing my desktop responsiveness, I get a "not enough free memory" error:

enter image description here

Complications setting ulimit for address space (virtual memory)

Unfortunately, some developers like to pretend virtual memory is an infinite resource and setting a ulimit on virtual memory can break some apps. E.g. WebKit (which some gnome apps depend on for integrating web content) added a gigacage security feature which tries to allocate insane amounts of virtual memory and FATAL: Could not allocate gigacage memory errors with a cheeky hint Make sure you have not set a virtual memory limit happen. The work-around, GIGACAGE_ENABLED=no forgoes the security benefits, but likewise, not being allowed to limit virtual memory allocation is also forgoes a security feature (e.g. resource control that can prevent denial of service). Ironically, between gigacage and gnome devs, they seem to forget that limiting memory allocation is itself a security control. And sadly, I noticed the gnome apps that rely on gigacage don't bother to explicitly request a higher limit, so even a soft limit breaks things in this case. As per Debian webkit-team NEWS:

webkit-based applications may not run if their maximum virtual memory size is restricted (e.g. using ulimit -v)

To be fair, if the kernel did a better job of being able to deny memory allocation based on resident memory use instead of virtual memory, then pretending virtual memory is unlimited would be less dangerous.

overcommit

If you prefer applications to be denied memory access and want to stop overcommitting, use the commands below to test how your system behaves when under high memory pressure.

In my case, the default commit ratio was:

$ sysctl vm.overcommit_ratio
vm.overcommit_ratio = 50

But it only comes into full effect when changing the policy to disable overcommiting and apply the ratio

sudo sysctl -w vm.overcommit_memory=2

The ratio implied only 24GB of memory could be allocated overall (16GB RAM*0.5 + 16GB SWAP). So I'd probably never see OOM show up, and effectively be less likely to have processes constantly access memory in swap. But I'll also likely sacrifice overall system efficiency.

This will cause many applications to crash, given it's common for devs to not gracefully handle the OS declining a memory allocation request. It trades off the occasional risk of a drawn out lockup due to thrashing (loose all your work after hard reset) to a more frequent risk of various apps crashing. In my testing, it didn't help much because the desktop itself crashed when the system was under memory pressure and it couldn't allocate memory. However, at least consoles and SSH still worked.

How does VM overcommit memory work has more info.

I chose to revert to default for this, sudo sysctl -w vm.overcommit_memory=0, given the whole desktop graphical stack and the applications in it crash nonetheless.

Conclusion

  • The newer systemd-oomd should make things better and kill a greedy process before physical memory pressure gets too high.
  • Traditional kernel OOM didn't work well because it acted only if physical and swap memory space was exhused.
  • ulimits and other ways of limiting overcommit doesn't work well for desktop software becacuse many apps request excessivly large memory spaces (e.g. apps using WebKit) or fail to gracefully handle out of memory exceptions if the kernel denies a memory request.