can high load cause server hang and error "blocked for more than 120 seconds"?

Currently running a few VM's and 'baremetal' servers. Java is running on high - over 400%+ at times. Randomly the server hangs with the error in the console "java - blocked for more than 120 seconds" - kjournald, etc.

I cannot get a dmesg output because for some reason this error only writes to the console, which I don't have access to since this is remotely hosted. therefore I cannot copy a full trace.

I changed the environment this is on - even physical server and it's still happening.

I changed hung_task_timeout_secs to 0 incase this is a false positive as per http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Technical_Notes/deployment.html .

Also, irqbalance is not installed, perhaps it would help?

this is Ubuntu 10.04 64bit - same issue with latest 2.6.38-15-server and 2.6.36 .

could cpu or memory issues/no swap left cause this issue?

here is the console message:

[58Z?Z1.5?Z840] INFUI task java:21547 blocked for more than 120 seconds.
[58Z?Z1.5?Z986] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z06Z] INFUI task kjournald:190 blocked for more than 120 seconds.
[58Z841.5?Z336] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z600] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z841.5?Z90?] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?3413] INFUI task java:21547 blocked for more than 120 seconds.
[58Z841.5?368Z] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?ZZ36] INFUI task kjournald:60 blocked for more than 120 seconds.
[58Z961.5?Z6Z5] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?31ZZ] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z961.5?3393] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.

Solution 1:

Yes, it could.

What this means is fairly explicit: the kernel couldn't schedule the task for 120 seconds. This indicates resource starvation, often around disk access.

irqbalance might help, but that doesn't sound obvious. Can you provide us with the surrounding of this message in dmesg, in particular the stack trace that follows it?

Moreover, this is not a false positive. This does not say that the task is hung forever, and the statement is perfectly correct. That doesn't mean it's a problem for you, and you can decide to ignore it if you don't notice any user impact.

This cannot be caused by:

  • a CPU issue (or rather, that would be an insanely improbable hardware failure),
  • a memory issue (very improbably a hardware failure, but wouldn't happen multiple times; not a lack of RAM as a process would be oom-killed),
  • a lack of swap (oom-killer again).

To an extend, you might be able blame this on a lack of memory in the sense that depriving your system of data caching in RAM will cause more I/O. But it's not as straightforward as "running out of memory".

Solution 2:

sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5

Then commit the change with:

sudo sysctl -p

solved it for me....