OOM kills a process while there is still a lot of available memory
I am aware that there are many threads like this one, but none of them actually give a methodology on how to read an OOM message.
I have a db that is being monitored through NRPE for available memory, using a shell script check that calls free -m ...
underneath grepping for available memory and checking if the threshold is fine.
Mysql was killed by OOM without that check ever being triggered. The message was:
[45948878.475805] master invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
[45948878.483847] master cpuset=/ mems_allowed=0
[45948878.488366] CPU: 9 PID: 1095 Comm: master Not tainted 3.10.0-514.16.1.el7.x86_64 #1
[45948878.496464] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[45948878.506124] ffff880ee42d5e20 0000000076e51669 ffff880ee4b43ae8 ffffffff81686ac3
[45948878.514480] ffff880ee4b43b78 ffffffff81681a6e ffffffff812aedab ffff880f02e96ad0
[45948878.522887] ffff880ee4b43bc0 ffffffff00000206 fffeefff00000000 0000000000000008
[45948878.531320] Call Trace:
[45948878.534171] [<ffffffff81686ac3>] dump_stack+0x19/0x1b
[45948878.539728] [<ffffffff81681a6e>] dump_header+0x8e/0x225
[45948878.545437] [<ffffffff812aedab>] ? cred_has_capability+0x6b/0x120
[45948878.552019] [<ffffffff81184c7e>] oom_kill_process+0x24e/0x3c0
[45948878.558306] [<ffffffff81093c0e>] ? has_capability_noaudit+0x1e/0x30
[45948878.565052] [<ffffffff811854b6>] out_of_memory+0x4b6/0x4f0
[45948878.571022] [<ffffffff81682577>] __alloc_pages_slowpath+0x5d7/0x725
[45948878.577769] [<ffffffff8118b5c5>] __alloc_pages_nodemask+0x405/0x420
[45948878.584562] [<ffffffff8108353d>] copy_process+0x1dd/0x1960
[45948878.590536] [<ffffffff81084e71>] do_fork+0x91/0x2c0
[45948878.595893] [<ffffffff81085126>] SyS_clone+0x16/0x20
[45948878.601405] [<ffffffff816974d9>] stub_clone+0x69/0x90
[45948878.606940] [<ffffffff81697189>] ? system_call_fastpath+0x16/0x1b
[45948878.613511] Mem-Info:
[45948878.616212] active_anon:14681394 inactive_anon:285983 isolated_anon:92
active_file:2380 inactive_file:3203 isolated_file:76
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:94372 slab_unreclaimable:33256
mapped:11621 shmem:768108 pagetables:32080 bounce:0
free:93898 free_pcp:64 free_cma:0
[45948878.652587] Node 0 DMA free:15908kB min:16kB low:20kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[45948878.694520] lowmem_reserve[]: 0 2977 60382 60382
[45948878.700271] Node 0 DMA32 free:230644kB min:3328kB low:4160kB high:4992kB active_anon:2476892kB inactive_anon:24492kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129332kB managed:3049456kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:72884kB slab_reclaimable:156940kB slab_unreclaimable:34796kB kernel_stack:19120kB pagetables:5252kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[45948878.746051] lowmem_reserve[]: 0 0 57404 57404
[45948878.751495] Node 0 Normal free:63356kB min:64236kB low:80292kB high:96352kB active_anon:56248684kB inactive_anon:1119440kB active_file:11540kB inactive_file:19792kB unevictable:0kB isolated(anon):368kB isolated(file):304kB present:59768832kB managed:58782432kB mlocked:0kB dirty:0kB writeback:0kB mapped:46328kB shmem:2999548kB slab_reclaimable:220548kB slab_unreclaimable:98228kB kernel_stack:25488kB pagetables:123068kB unstable:0kB bounce:0kB free_pcp:5208kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5126 all_unreclaimable? no
[45948878.800106] lowmem_reserve[]: 0 0 0 0
[45948878.804880] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[45948878.821646] Node 0 DMA32: 2212*4kB (UEM) 1679*8kB (UEM) 845*16kB (UEM) 1332*32kB (UEM) 648*64kB (UEM) 330*128kB (UEM) 192*256kB (UEM) 38*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 230744kB
[45948878.841497] Node 0 Normal: 16437*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 65748kB
[45948878.855460] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[45948878.864552] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[45948878.873384] 773511 total pagecache pages
[45948878.877730] 0 pages in swap cache
[45948878.881437] Swap cache stats: add 0, delete 0, find 0/0
[45948878.887060] Free swap = 0kB
[45948878.890329] Total swap = 0kB
[45948878.893592] 15728539 pages RAM
[45948878.897028] 0 pages HighMem/MovableOnly
[45948878.901251] 266590 pages reserved
Am I monitoring memory usage incorrectly? What does this message say? I don't know how to interpret the values for memory. What can I be checking on the system to prevent this from happening again?
Solution 1:
Usually your monitoring interval is way too long to catch issues where a process quickly grabs all available memory. Agressive configured Icinga or Nagios checks every Minute. So on average there is 30 seconds between a problem occuring and the check running. In the special case of OOM the OOM itself will probably defer the check for a few seconds and once the check is finally executed OOM already happened and memory has been free'd, so the check will return OK.
If you keep graphs of your monitoring results it might be a good idea to take a look at these.
To fix the root-cause of your issue you should probably take a look at the following article: How does vm.overcommit_memory work?
Solution 2:
According to this line
[45948878.751495] Node 0 Normal free:63356kB min:64236kB low:80292kB high:96352kB active_anon:56248684kB inactive_anon:1119440kB active_file:11540kB inactive_file:19792kB unevictable:0kB isolated(anon):368kB isolated(file):304kB present:59768832kB managed:58782432kB mlocked:0kB dirty:0kB writeback:0kB mapped:46328kB shmem:2999548kB slab_reclaimable:220548kB slab_unreclaimable:98228kB kernel_stack:25488kB pagetables:123068kB unstable:0kB bounce:0kB free_pcp:5208kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5126 all_unreclaimable? no
your system has minimum memory limit of 64236 kB (~64 MB) and your system was running at 63356 kB which is smaller than your minimum limit and as a result, OOM Killer was triggered.
In addition, your memory was very heavily fragmented and you only had 4 KB fragments available and the request was for 16 KB block. See this line for details:
[45948878.841497] Node 0 Normal: 16437*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 65748kB
It seems the applications you use have problematic memory access patterns that badly fragment your memory. Adding swap to your system may help to reduce the fragmentation.
Why do you think you had a lot of available memory? Are you using some memory statistics application that cannot monitor spikes in memory usage?