How to diagnose causes of oom-killer killing processes
I have a small virtual private server running CentOS and www/mail/db, which has recently had a couple of incidents where the web server and ssh became unresponsive.
Looking at the logs, I saw that oom-killer had killed these processes, possibly due to running out of memory and swap.
Can anyone give me some pointers at how to diagnose what may have caused the most recent incident? Is it likely the first process killed? Where else should I be looking?
Solution 1:
No, the algorithm is not that simplistic. You can find more information in:
http://linux-mm.org/OOM_Killer
If you want to track memory usage, I'd recommend running a command like:
ps -e -o pid,user,cpu,size,rss,cmd --sort -size,-rss | head
It will give you a list of the processes that are using the most memory (and probably causing the OOM situation). Remove the | head
if you'd prefer to check all the processes.
If you put this on your cron, repeat it every 5 minutes and save it to a file. Keep at least a couple of days, so you can check what happened later.
For critical services like ssh, I'd recommend using monit for auto restarting them in such a situation. It might save from losing access to the machine if you don't have a remote console to it.
Best of luck,
João Miguel Neves
Solution 2:
I had a hard time with that recently, because the process(es) that the oom-killer stomps on aren't necessarily the ones that have gone awry. While trying to diagnose that, I learned about one of my now-favorite tools, atop.
This utility is like a top on steroids. Over a pre-set time interval, it profiles system information. You can then play it back to see what's going on. It highlights processes that ar 80%+ in blue and 90%+ in red. The most useful view is a memory usage table of how much memory was allocated in the last time period. That's the one that helped me the most.
Fantastic tool -- can't say enough about it.
atop performance monitor
Solution 3:
This article on taming oom-killer looks particularly useful. Seems you can set priorities to prevent oom-killer killing certain processes (sshd would be a good start for a VPS!)