ps aux hanging on high cpu/IO with java processes

In general, I've seen this happen because of a stalled-read. This is confirmed by your strace output. The attempt to read /proc/xxxx/cmdline file hangs while you're running ps aux command.

The momentary spikes in I/O are starving the system's resources. A load of 90-160 is extremely bad news if it's storage subsystem-related.

For the storage array, can you tell us if there's a hardware RAID controller in place? Is the primary application on the server write-biased? The disks you mention (12 x 4TB) are lower-speed nearline SAS or SATA disks. If there's no form of write caching in front of the drive array, writes are capable of pushing the system load way up. If these are pure SATA drives on a Supermicro backplane, don't discount the possibility of other disk problems (timeouts, failing drive, backplane, etc.) Does this happen on all Hadoop nodes?

An easy test is to try to run iotop while this is happening. Also, since this is EL6.5, do you have any of the tuned-adm settings enabled? Are write barriers enabled?

If you haven't changed the server's I/O elevator, ionice may have an impact. If you've changed it to anything other than CFQ, (this server should probably be on deadline), ionice won't make any difference.

Edit:

One other weird thing I've seen in production environments. These are Java processes, and I'll assume they're heavily multithreaded. How are you doing on PIDs? What's the sysctl value for kernel.pid_max? I've had situations where I've exhausted PIDs before and had a resulting high load.

Also, you mention kernel version 2.6.32-358.23.2.el6.x86_64. That's over a year old and part of the CentOS 6.4 release, but the rest of your server is 6.5. Did you blacklist kernel updates in yum.conf? You should probably be on kernel 2.6.32-431.x.x or newer for that system. There could be a hugepages issue with the older kernel you have. If you can't change the kernel, try disabling them with:

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled.


The problem is clear not a disk related problem. And this is clear from the hanged strace:

open("/proc/18679/cmdline", O_RDONLY)   = 5
read(5,

/proc is an interface between kernel and userspace. It does not touch the disk at all. If something is hanged reading the arguments of a command it is usually a kernel related problem, and unlikely a storage one. See the @kasperd comment.

The load is just a side effect of the problem and the high number does not tell the full story. You could have a server with very high load on which the application behaves without any glitch.

You can gain more information about what is happening with cat /proc/$PID/stack. Where $PID is the process ID where the read stalls.

In your case I would start with a kernel upgrade.