How to get the Linux OOM killer to not kill my process?
This appears to be problem in combination of two factors:
- Using a virtual machine.
- A possible kernel bug.
This is partly one of the lines which describes why this happens:
Mar 7 02:43:11 myhost kernel: memcheck-amd64- invoked oom-killer: gfp_mask=0x24002c2, order=0, oom_score_adj=0
The other line is this:
Mar 7 02:43:11 myhost kernel: 0 pages HighMem/MovableOnly
|The first line is the GFP mask assigned for the allocation. It basically describes what the kernel is allowed/not allowed to do to satify this request.
The mask indicates a bunch of standard flags. The last bit, '2' however indicates the memory allocation should come from the HighMem
zone.
If you look closely at the OOM output, you'll see no HighMem/Normal
zone actually exists.
Mar 7 02:43:11 myhost kernel: Node 0 DMA: 20*4kB (UM) 17*8kB (UM) 13*16kB (M) 14*32kB (UM) 8*64kB (UM) 4*128kB (M) 4*256kB (M) 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 3944kB
Mar 7 02:43:11 myhost kernel: Node 0 DMA32: 934*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3960kB
HighMem
(generally also called Normal
on x86_64) tends to map memory for zones outside of the standard 896MiB ranges directly kernel accessible on 32 bit systems. On x86_64 HighMem/Normal
seems to cover all pages above 3GiB in size.
DMA32
contains a zone used for memory that would be accessible on 32-bit DMA devices, that is you can address them with 4 byte pointers. I believe DMA
is for 16-bit DMA devices.
Generally speaking, on low memory systems Normal
wouldn't exist, given that DMA32
covers all available virtual addresses already.
The reason you OOM kill is because there is a memory allocation for a HighMem
zone with 0 pages available. Given the out of memory handler has absolutely no way to satisfy making this zone have pages to use by swapping, killing other processes or any other trick, OOM-killer just kills it.
I believe this is caused by the host VM ballooning on boot up. On KVM systems, there are two values you can set.
- The current memory.
- The available memory.
The way this works is that you can hot-add memory to your server up to the available memory. Your system however is actually given the current memory.
When a KVM VM boots up, it starts with the maximum allotment of memory possible to be given (the available memory). Gradually during the boot phase of the system KVM claws back this memory using its ballooning, leaving you instead with the current memory setting you have.
Its my belief thats what happened here. Linode allow you to expand the memory, giving you much more at system start.
This means that there is a Normal/HighMem
zone at the beginning of the systems lifetime. When the hypervisor balloons it away, the normal zone rightly disappears from the memory manager. But, I suspect that the flag setting whether the said zone is available to allocate from is not cleared when it should. This leads the kernel to attempt to allocate from a zone that does not exist.
In terms of resolving this you have two options.
Bring this up on the kernel mailing lists to see if this really is a bug, behaviour expected or nothing at all to do with what I'm saying.
Request that linode set the 'available memory' on the system to be the same 1GiB assignment as the 'current memory'. Thus the system never balloons and never gets a Normal zone at boot, keeping the flag clear. Good luck getting them to do that!
You should be able to test that this is the case by setting up your own VM in KVM setting available to 6GiB, current to 1GiB and running your test using the same kernel to see if this behaviour you see above occurs. If it does, change the 'available' setting to equal the 1GiB current and repeat the test.
I'm making a bunch of educated guesses here and reading inbetween the lines somewhat to come up with this answer, but what I'm saying seems to fit the facts outlined already.
I suggest testing my hypothesis and letting us all know the outcome.
To answer your headline question, use oom_score_adj
(kernel >=2.6.36) or for earlier kernels (>=2.6.11) oom_adj
, see man proc
/proc/[pid]/oom_score_adj (since Linux 2.6.36) This file can be used to adjust the badness heuristic used to select which process gets killed in out-of-memory conditions...
/proc/[pid]/oom_adj (since Linux 2.6.11) This file can be used to adjust the score used to select which process should be killed in an out-of-memory (OOM) situation...
There's lots more to read but setting oom_score_adj to -1000 or oom_adj to -17 is going to achieve what you want.
The trouble is something else will be killed. Perhaps it would be better to determine why OOM is being invoked and deal with that.
Several thoughts (from my comments above), and links to interresting reads about your situation:
I recommend that you check that 1) you can adress more than 3Gb with your current kernel & config (& cpu) [because if 3Gb is a limit for your system & os, you are exceeding it] . 2) that you allow swapping and the swapping subsystem is in place and working. good luck (I won't explain how, as it depends on your settings & specifics. Search engines will help). AND that you are not overflowing a kernel table (nb of pids? or anything else (some can maybe be set at kernel compile time).
Check that the whole thing (hardware, or vm's simulated hardware, etc) is 64bits. (see for example: https://askubuntu.com/questions/313379/i-installed-a-64-bit-os-in-a-32-bit-processor/313381 ). The cpu & host OS & vm subsystem & vm OS should all be 64-bit enabled, otherwise you'll not have a real 64-bit vm
-
Some good reads:
- How to diagnose causes of oom-killer killing processes : on this very site. Gives a nice link, http://linux-mm.org/OOM_Killer, and another user introduces the atop tool, which could help diagnose what happens
- https://www.kernel.org/doc/gorman/html/understand/understand016.html : 3.12 shows the decision tree for the OOM (of an older version, so ymmv. Read the source ^^)
- https://lwn.net/Articles/317814/ : "Taming the OOM killer" shows, among other things, how to create an "invicible" group and put your process in it
- http://eloquence.marxmeier.com/sdb/html/linux_limits.html : shows some common kernel limits
- http://win.tue.nl/~aeb/linux/lk/lk-9.html : is a good read (wasted space with some allocations methods, etc) even though it is really dated (and thus adresses only 32-bit architectures...)
- http://bl0rg.krunch.be/oom-frag.html shows some reasons why "The OOM killer may be called even when there is still plenty of memory available"
- https://stackoverflow.com/questions/17935873/malloc-fails-when-there-is-still-plenty-of-swap-left : gives a way to check how your system allocates memory
- https://serverfault.com/a/724518/146493 : excellent answer giving a detailled way to know really what happened (but... targeted a 32-bit question, so ymmv. Still an incredible read and answer, though).
and finally : http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html shows a way to prevent your process from being targeted by the oom killer ! (
echo -17 > /proc/PROCESSPID/oom_adj
) . Could be prone to changes, and could be a bad thing (cause other kind of failures as the system now can't simply kill the main offender...) Use with caution. @iain note that "oom_adj" is for older kernels, and should be replaced by "oom_score_adj" in the newer ones. Thank, Iain)
beside mentioned oom_score_adj
increasing for the process in question (which probably won't help much -- it would make it less likely that that process would be killed FIRST, but as that is only memory intensive process system probably won't recover until it is finally killed), here are few ideas to tweak:
- if you set
vm.overcommit_memory=2
, also tweakvm.overcommit_ratio
to maybe 90 (alternatively, setvm.overcommit_memory=0
- see kernel overcommit docs) - increase
vm.min_free_kbytes
in order to always keep some physical RAM free and thus reduce chances of OOM needing to kill something (but do not overdo it, as it will OOM instantly). - increase
vm.swappiness
to 100 (to make kernel swap more readily)
Note that if you have too little memory to accomplish the task at hand, even if you do not OOM, it may (or may not) become EXTREMELY slow - half an hour job (on system with enough RAM) can easily take several weeks (when RAM is replaced with swap) to complete in extreme cases, or even hang whole VM. That is especially the case if swap is on classical rotational disks (as opposed to SSDs) due to massive random reads/writes which are very expensive on them.
I would try to enable overcommit and see if that helps. Your process seems to fail inside a fork
call, which requires as much virtual memory as the initial process had. overcommit_memory=2
doesn't make your process immune to OOM killer, it just prevents your process from triggering it by allocating too much. Other processes may produce unrelated allocation errors (e.g. getting a contiguous memory block), which still trigger the OOM killer and get your process disposed of.
Alternatively (and more to the point), as several comments suggest, buy more RAM.