SSHD Consuming 100% of CPU (Hundreds of processes) - Will not die

Recently I've noticed that SSHd on a few systems I administrate will start spawning unstoppable processes which will consume huge amounts of CPU.

The syscall shows that the processes are all 'running' and are not zombie or waiting for their parent to kill them (at least not as far as I can tell).

enter image description here

I've tried everything to kill these processes... the only reliable method I've found to date is to restart the whole server (which isn't ideal). I've tried switching out openssh-server for dropbear, but it doesn't behave the way I need it to for my applications.

I've tried:

killall -9 sshd

killing each sshd by it's id. Few other misc things (htop + sigterm, etc...)

I'd love some ideas either for killing these processes or for solving what causes this.

If those are actually OpenSSH sshd's, my guess would be that a script somewhere is running them in a broken loop and one of their child processes are hogging up all that CPU.

As Marki555 suggested, strace would help you, but you should use strace -f so that strace will follow the child processes. From man strace:

   -f          Trace child processes as they are  created  by  cur-
               rently  traced  processes as a result of the fork(2)
               system call.

Because strace generates so much data, it might also be a good idea to use the -e argument as well (for example, to only show open() calls):

   -e expr     A qualifying expression which modifies which  events
               to  trace  or  how to trace them.  The format of the
               expression is:
                         [qualifier=][!]value1[,value2]...

Another command you can try is ps xaf or pstree -a to get an easy to understand tree view of processes and their children processes so that you can determine what process started those sshd's. lsof might also help you, it'll tell you what files a process has open.

And of course, make sure you're using the latest OpenSSH. I'm thinking rsync + big files + ssh on an ancient OpenSSH 3.4p1 would be problematic.

If those are NOT really sshd processes, then an MD5 checksum of the binary file might display correctly but it might not be the actual sshd program running. Also the md5sum command itself could be a backdoored version that's modified to report the correct checksum for certain files (like sshd).

You should take a look at /proc/[sshd pid]/exe and make sure it's a symlink to /usr/sbin/sshd (or wherever your sshd is), as well as /proc/[sshd pid]/environ to see what environment variables it's using, and /proc/[sshd pid]/cmdline to see what command actually started it.

Though an attacker could have renamed the malicious program to "sshd" then executed it to make it appear to be sshd. Could have even moved /usr/sbin/sshd to /tmp/sshd then moved the malicious sshd to /usr/sbin/sshd to attempt to hide it from that type of /proc analysis, but when /tmp/sshd gets moved back to /usr/sbin/sshd the /proc/[sshd pid]/exe symlink will show in ls as:

   lrwxrwxrwx 1 root root 0 May 19 06:47 /proc/[sshd pid]/exe ->  (deleted)/usr/sbin/sshd

Also, if those sshd's are doing something to actively prevent normal process analysis, you could try kill -STOP instead of kill -9 to "pause" the process (use kill -CONT to resume it. See http://en.wikipedia.org/wiki/Job_control_%28Unix%29#Implementation).

However, if an attacker has root privileges then a rootkit could be installed that's hiding from /proc, netstat, ls, etc. If you're really compromised, best course of action is to take the system offline and mount it's partitions on another (clean) system then do the forensics (or use one of those live Linux CDs made for forensics).

SSHD Consuming 100% of CPU (Hundreds of processes) - Will not die

Related

Recent Posts