Avoid linux out-of-memory application teardown

By default Linux has a somewhat brain-damaged concept of memory management: it lets you allocate more memory than your system has, then randomly terminates a process when it gets in trouble. (The actual semantics of what gets killed are more complex than that - Google "Linux OOM Killer" for lots of details and arguments about whether it's a good or bad thing).

To restore some semblance of sanity to your memory management:

  1. Disable the OOM Killer (Put vm.oom-kill = 0 in /etc/sysctl.conf)
  2. Disable memory overcommit (Put vm.overcommit_memory = 2 in /etc/sysctl.conf)
    Note that this is a trinary value: 0 = "estimate if we have enough RAM", 1 = "Always say yes", 2 = "say no if we don't have the memory")

These settings will make Linux behave in the traditional way (if a process requests more memory than is available malloc() will fail and the process requesting the memory is expected to cope with that failure).

Reboot your machine to make it reload /etc/sysctl.conf, or use the proc file system to enable right away, without reboot:

echo 2 > /proc/sys/vm/overcommit_memory 

You can disable overcommit, see http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt#514

The short answer, for a server, is buy and install more RAM.

A server that routinely enough experienced OOM (Out-Of-Memory) errors, then besides the VM (virtual memory) manager's overcommit sysctl option in Linux kernels, this is not a good thing.

Upping the amount of swap (virtual memory that has been paged out to disk by the kernel's memory manager) will help if the current values are low, and the usage involves many tasks each such large amounts of memory, rather than a one or a few processes each requesting a huge amount of the total virtual memory available (RAM + swap).

For many applications allocating more than two time (2x) the amount of RAM as swap provides diminishing return on improvement. In some large computational simulations, this may be acceptable if the speed slow-down is bearable.

With RAM (ECC or not) be quite affordable for modest quantities, e.g. 4-16 GB, I have to admit, I haven't experienced this problem myself in a long time.

The basics at looking at the memory consumption including using free and top, sorted by memory usage, as the two most common quick evaluations of memory usage patterns. So be sure you understand the meaning of each field in the output of those commands at the very least.

With no specifics of applications (e.g. database, network service server, real-time video processing) and the server's usage (few power users, 100-1000s of user/client connections), I cannot think of any general recommendations in regards to dealing with the OOM problem.

You can use ulimit to reduce the amount of memory a process is allowed to claim before it's killed. It's very usefull if your problem is one or a few run away processes that crashes your server.

If your problem is that you simply don't have enough memory to run the services you need there are only three solutions:

  1. Reduce the memory used by your services by limiting caches and similar

  2. Create a larger swap area. It will cost you in performance, but can buy you some time.

  3. Buy more memory