I can't ssh into a remote server when it runs out of memory even though the swap isn't fully used
I have a godaddy server that has been becoming unresponsive periodically. It was difficult to troubleshoot because I can't ssh into it when it becomes unresponsive. I figured out what was happening by adding a cron job that piped output from "top" to log files every 5 minutes. The next time I power cycled it after it became unresponsive I checked those logs and found that the ram was maxed out, but the swap was mostly unused.
I'm working on reducing ram usage by the two app servers on that machine (it turns out too many connections were getting opened. Each one used up 30m, so after 40 got opened the server runs out of ram), but what I'd really like to know is how to ensure I can ssh into the machine.
If the swap file isn't full then I'd think there'd be enough space for the server to respond, even if it did so slowly. Is there any way I can reserve a bit of ram so that I can always ssh into the machine?
Here is an example of what it looks like when the server is running normally:
top - 15:13:21 up 3:12, 2 users, load average: 0.15, 0.30, 0.33
Tasks: 127 total, 1 running, 126 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.4%us, 1.8%sy, 0.0%ni, 95.7%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 2064980k total, 1611252k used, 453728k free, 45852k buffers
Swap: 2096472k total, 0k used, 2096472k free, 790212k cached
Here is the last top log that got logged before the server stopped running:
top - 14:45:08 up 15:20, 0 users, load average: 0.27, 0.16, 0.10
Tasks: 141 total, 2 running, 139 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.7%us, 1.9%sy, 0.0%ni, 95.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2064980k total, 2007652k used, 57328k free, 60496k buffers
Swap: 2096472k total, 100k used, 2096372k free, 689584k cached
Note that the cron job logging the "top" output also stops running when the server runs out of ram, so the whole server just grinds to a halt apparently.
I have had similar issues to this before and they can be a nuisance to track down. As you have provided much information to go on I'll have to spell out some things to check and also what my issue turned out to be.
First, check your logs. Most notably in this case the output of dmesg (this is the kernel ring buffer, where it dumps log data). This is flushed out periodically to a file(s) in /var/log though where exactly depends on your OS. For example, Red Hat has a /var/log/dmesg file. You are looking for anything that looks unusual especially relating to the OOM killer process. This ends programs when RAM starts getting full in an attempt to keep the server up and responsive. sshd should be exempt from this but this is distro specific as to how it is set. The modern form of specifying an OOM exemption is to give sshd a score telling the kernel how precious it is to the server as a whole (which should put it far down the list of processes to kill if an critical RAM situation occurs). Your distro should have set this correctly.
The other thing to check is that your server has enough entropy with the following:
cat /proc/sys/kernel/random/entropy_avail
An OK value is above approx 1000-1500. Below and your running low. It only really goes up to approx 4000-5000 on my machines (these are based off my observations of my servers).
I have had issues logging in to servers where the entropy was so low (and generation was slow) that applications would hang waiting for more entropy to be available. There's is an infamous Debian Exim bug that highlighted this. Exim used GNU TLS on Debian which only used /dev/random and uses masses of entropy for each connection. See here. When the entropy was exhausted, Exim would just hang. It would also cause other programs that relied on entropy to start refusing connections as well.
As session keys are generated every session, sshd needs a good source of random numbers. Though it should be using /dev/urandom to gather pseudo random numbers if /dev/random is blocking, I'm not sure whether sshd will be doing this.
This issue can be quite severe on virtual systems as a lot of the random number sources are not passed in to the virtual machine. The main source of entropy is disk I/O but this typically isn't passed into the VM. Hardware Random Number Generators that may be embedded in the physical machine's chipset/CPU are also unlikely to be passed into the Virtual Machine.
This is a pretty good write up on the matter.
I run rngd
in the background on my server to feed /dev/random with data from /dev/urandom with this:
rngd -r /dev/urandom -o /dev/random
This isn't a great solution but is a useful hack to keep things together while you look for better random number sources. I'm looking into feeding rngd
with data from a different source but haven't had much chance to do so yet.