Centos server not using SWAP properly and getting OOM
Recently I've been having some serious memory issues with my server. Just the other day, my server became completely unresponsive, and oom-killer started killing services at random (httpd, php, etc). I couldn't even SSH into my server, but I was able to PING it.
I did look at the kernel messages log, but there wasn't any clear indication as to what was causing the memory problem - all I could see was all the oom-killer messages.
sar -r
command:
03/15/2012
12:00:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad
12:10:01 AM 2881812 582380 16.81 26652 250192 4192944 0 0.00 0
12:20:01 AM 2883600 580592 16.76 27104 250196 4192944 0 0.00 0
12:30:01 AM 2878576 585616 16.90 27656 250320 4192944 0 0.00 0
12:40:01 AM 2851856 612336 17.68 28312 271540 4192944 0 0.00 0
12:50:01 AM 2843560 620632 17.92 28968 274468 4192944 0 0.00 0
01:00:01 AM 2843892 620300 17.91 29440 274644 4192944 0 0.00 0
01:10:01 AM 22868 3441324 99.34 60764 2947884 4192936 8 0.00 8
01:20:01 AM 13836 3450356 99.60 62064 2882544 4192844 100 0.00 92
01:30:03 AM 14024 3450168 99.60 7820 3040976 4192844 100 0.00 0
01:40:01 AM 18600 3445592 99.46 18720 3039152 4192844 100 0.00 0
01:50:01 AM 25352 3438840 99.27 20048 3034584 4192844 100 0.00 0
02:00:01 AM 22572 3441620 99.35 20872 3036896 4192844 100 0.00 0
02:10:01 AM 21408 3442784 99.38 21776 3038236 4192844 100 0.00 0
02:20:01 AM 23240 3440952 99.33 23168 3032372 4192844 100 0.00 0
02:30:01 AM 72392 3391800 97.91 25100 2981488 4192844 100 0.00 0
02:40:01 AM 70876 3393316 97.95 25824 2981756 4192844 100 0.00 0
02:50:01 AM 74200 3389992 97.86 26464 2981860 4192844 100 0.00 0
03:00:01 AM 64980 3399212 98.12 32616 2982240 4192844 100 0.00 0
03:10:01 AM 63704 3400488 98.16 33564 2984268 4192844 100 0.00 0
03:20:01 AM 59564 3404628 98.28 34592 2988936 4192844 100 0.00 0
03:30:01 AM 53972 3410220 98.44 35740 2992484 4192844 100 0.00 0
03:40:01 AM 89120 3375072 97.43 36472 2956088 4192844 100 0.00 0
03:50:01 AM 88788 3375404 97.44 36920 2956324 4192844 100 0.00 0
04:00:01 AM 78540 3385652 97.73 37740 2964452 4192844 100 0.00 0
04:10:01 AM 21720 3442472 99.37 106636 2892836 4192844 100 0.00 0
04:20:01 AM 22796 3441396 99.34 107172 2890796 4192844 100 0.00 0
04:30:01 AM 30604 3433588 99.12 107812 2884644 4192844 100 0.00 0
04:40:01 AM 32744 3431448 99.05 108568 2875944 4192844 100 0.00 0
Here is top
sorted by swapped size:
top - 14:32:01 up 15:37, 1 user, load average: 0.10, 0.10, 0.04
Tasks: 110 total, 3 running, 107 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.5%us, 0.3%sy, 0.0%ni, 98.4%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3464192k total, 2663384k used, 800808k free, 140796k buffers
Swap: 4192944k total, 100k used, 4192844k free, 2073748k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP COMMAND
1975 mysql 15 0 222m 43m 4652 S 0.0 1.3 0:11.82 178m mysqld
1859 named 22 0 161m 5228 1948 S 0.0 0.2 0:00.04 156m named
2144 root 18 0 143m 47m 1072 S 0.0 1.4 0:00.00 95m spamd
2119 root 15 0 143m 49m 2628 S 0.0 1.5 0:01.17 94m spamd
2161 root 15 0 93372 1280 936 S 0.0 0.0 0:00.01 89m pure-ftpd
2163 root 18 0 91016 976 804 S 0.0 0.0 0:00.01 87m pure-authd
20035 root 15 0 91800 3096 2408 S 0.0 0.1 0:00.00 86m sshd
19432 root 15 0 92232 3656 2900 R 0.0 0.1 0:00.00 86m sshd
2377 root 19 0 93268 14m 1940 S 0.0 0.4 0:00.00 76m cpdavd
2380 root 15 0 87824 11m 1520 S 0.0 0.3 0:00.07 74m cpsrvd-ssl
3115 root 15 0 74832 1168 584 S 0.0 0.0 0:00.05 71m crond
18548 root 18 0 73624 3036 236 S 0.0 0.1 0:00.00 68m httpd
19713 nobody 18 0 73760 4460 1584 S 0.0 0.1 0:00.00 67m httpd
19712 nobody 15 0 73760 4484 1584 S 0.0 0.1 0:00.00 67m httpd
19709 nobody 18 0 73624 4460 1584 S 0.0 0.1 0:00.00 67m httpd
19508 nobody 15 0 73760 4600 1680 S 0.0 0.1 0:00.00 67m httpd
19162 nobody 15 0 73756 4640 1708 S 0.0 0.1 0:00.01 67m httpd
19154 nobody 15 0 73756 4656 1728 S 0.0 0.1 0:00.00 67m httpd
19157 nobody 15 0 73756 4696 1740 S 0.0 0.1 0:00.01 67m httpd
19327 nobody 15 0 73756 4700 1740 S 0.0 0.1 0:00.01 67m httpd
19163 nobody 15 0 73756 4768 1836 S 0.0 0.1 0:00.00 67m httpd
19164 nobody 15 0 73756 4788 1856 S 0.0 0.1 0:00.00 67m httpd
2145 root 18 0 73624 5740 2940 S 0.0 0.2 0:00.60 66m httpd
1911 root 20 0 65952 1276 1044 S 0.0 0.0 0:00.01 63m mysqld_safe
For some reason, it says that it's only using 100k SWAP, but that doesn't make any sense. Isn't VIRT
the amount of SWAP being used by each process?
* Update *
Here is some more information on the file systems:
# df -T
Filesystem Type 1K-blocks Used Available Use% Mounted on
/dev/md2 ext3 468924192 17215692 427504176 4% /
/dev/md1 ext3 2030672 58788 1867068 4% /tmp
/dev/md0 ext3 101018 13414 82388 15% /boot
tmpfs tmpfs 1732096 0 1732096 0% /dev/shm
* Update 2 *
Here is the free -m
that I managed to run when the server was in this OOM state, yesterday:
total used free shared buffers cached
Mem: 3383 3372 10 0 0 6
-/+ buffers/cache: 3365 17
Swap: 4094 4094 0
Solution 1:
I usually sort by memory ("M" in top) to troubleshoot these kinds of things--that shows you the amount of real memory that each process is using (and touching frequently enough to keep it off the least-recently-used queue for being swapped).
VIRT = RES + SWAP
Another thing to check is whether /tmp is a tmpfs file system and if something is writing a lot of data there.
I am actually a little confused by what I'm seeing. Is this sar
output over the interval when your outage occurred or just the default output? And the top
output is from a totally different time, 14:32?
Also, it's not really using swap at the time you took these stats because it doesn't need to--nearly 3G of your memory is currently being used as disk cache ("kbcached") and you only have kbmemused - kbcached + kbbuffers = 664072KiB (648MiB) [at 04:40:01] in use by actual processes.
Because no process image is using much memory itself but yet the oom-killer started, then I would guess that something started performing a lot of file I/O and started dirtying pages faster than could be written to disk. I'm not really sure that should trigger the oom-killer though.
None of these dirty pages would go to swap, because it's about as easy to write the content of the file itself out as it is to write the data to swap.
The obvious guess is that mysqld was doing this, although I would suspect that it would open its files with O_DIRECT, which suggests to the kernel to minimize effects on the cache (with the premise that the DB server is doing its own caching).
Update
Based on your free
output from update #2, the answer to the question in your topic is that it's using swap just fine; something just used all of it. The other data you provided is normal for a system that has recently boot.
Update 2
I mentioned mysql below, but I would be surprised is that's the culprit, honestly. I would suspect spamd, the CPanel processes or web applications running within Apache first.
I have also been assuming that you're running a reasonably current distro without any tweaking of system tunables and that you're current on security patches. There was a BIND exploit in the last few months that resulted in a DoS but I cannot recall if the exploit triggered memory exhaustion or something else. I have also read of CPanel exploits recently, but I don't know how current those were.