Log - Server kernel: INFO: task httpd:000000 blocked for more than 120 seconds

Almost everyday my server is crashing due to hight server load, and even restarting apache or mysql can't solve the problem. I need to reboot the server to solve, or it crash again due to the high load.

The log system records something like this when it crashes:

Aug 11 18:33:53 server kernel: INFO: task httpd:20008 blocked for more than 120 seconds.
Aug 11 18:33:53 server kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 11 18:33:53 server kernel: httpd         D ffffffff801538ac     0 20008   5816         20066 19809 (NOTLB)
Aug 11 18:33:53 server kernel:  ffff81025a299dc8 0000000000000082 ffff81033b4c0740 ffffffff80009a14
Aug 11 18:33:53 server kernel:  ffff8101063f8d80 0000000000000009 ffff8100b758f7e0 ffff8101c57187e0
Aug 11 18:33:53 server kernel:  00009436d4100b6c 000000000001d50f ffff8100b758f9c8 000000083b531588
Aug 11 18:33:53 server kernel: Call Trace:
Aug 11 18:33:53 server kernel:  [<ffffffff80009a14>] __link_path_walk+0x173/0xfb9
Aug 11 18:33:53 server kernel:  [<ffffffff8002cc16>] mntput_no_expire+0x19/0x89
Aug 11 18:33:53 server kernel:  [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b
Aug 11 18:33:53 server kernel:  [<ffffffff80023908>] __path_lookup_intent_open+0x56/0x97
Aug 11 18:33:53 server kernel:  [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14
Aug 11 18:33:53 server kernel:  [<ffffffff8001b21f>] open_namei+0xea/0x712
Aug 11 18:33:54 server kernel:  [<ffffffff8002768a>] do_filp_open+0x1c/0x38
Aug 11 18:33:54 server kernel: Firewall: *UDP_IN Blocked* IN=eth1 OUT= MAC=ff:ff:ff:ff:ff:ff:00:30:48:9e:6e:99:08:00 SRC=208.43.135.158 DST=255.255.255.255 LEN=151 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=UDP SPT=38354 DPT=6112 LEN=131 
Aug 11 18:33:54 server kernel:  [<ffffffff8001a061>] do_sys_open+0x44/0xbe
Aug 11 18:33:54 server kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

I googled a lot trying to find a solution. But it looks that the solution is just to update the kernel or disk driver, thinks that I don't know how to do.

In this url http://bugs.centos.org/view.php?id=4515 a lot o people report similar problems, except the fact that they are not related to httpd like mine.

According to one member, one solution would be to add "elevator=noop " to /etc/grub.conf like in this example:

title CentOS (2.6.18-238.12.1.el5xen)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-238.12.1.el5xen ro root=/dev/VolGroup00/LogVol00 elevator=noop
        initrd /initrd-2.6.18-238.12.1.el5xen.img

Would this really solve the problem? My disk are working in RAID. Can this cause some problem to my server?

Is there any other solution?


This is because of a mutex lock.

Check the stack trace printed carefully.It goes upside down. You will find this line

mutex_lock_slowpath

Seems there is a resource crunch.

Sysstat as suggested is a good profiling tool in most cases. If you need to go to the root of the issue, then you will require a vmcore or kernel memory dump. There are two /proc files called

/proc/sys/kernel/hung_task_timeout_secs
/proc/sys/kernel/hung_task_panic

The value of the first file is 120. That is why you are seeing messages that the task is blocked for 120 seconds. A trivial test is to increase it and see what happens. Make it 240 or 360.

Next file by default has a value of 0. This needs to be 1 if you want to collect a vmcore.

Obviously, you need to set up kdump and fix the dump target. The dump target should be larger than the physical memory size. But even if you collect the vmcore, you will need some C, assembly and general debugging knowledge to get a hang of it. A professional support or sysadmin can help better.

But imo, changing elevator won't affect anything here.