Ubuntu load average spikes but CPUs are idle

We have a server on a cloud network, provided by a third party. We are running Ubuntu 10.04 Server Edition.

The problem happens at seemingly random times, around one to three times a day. The load average in top usually sits around 2 and the server runs peachy, but at these random times, the load average spikes to around 30-35 and everything grinds to a halt. Can't access our websites, can't execute commands on the server, can't do anything. Can't even log in if you weren't already logged in.

The only way we have been able to see the high load averages is by constantly running top so that it is already running when the problem occurs. Seems that if it's already running, it will keep working properly, but if it isn't running, you can't start it. The inability to run any commands when it gets into this state has been making it difficult for us to diagnose the problem...plus we do not consider ourselves server experts by any means.

The thing that looks weird to me is, the load average spikes so high, but the processors stay idle, and there is plenty of free memory. Again, I'm not an expert at all, but my very basic understanding was that if memory is availble and the processors aren't maxed out, then there shouldn't be processes waiting (very possible I'm mistaken on that).

As I was typing this, I caught it as it was starting to spike and managed to run a few commands before everything locked up. Output is as follows:

uname -a

Linux <server name> 2.6.32-308-ec2 #16-Ubuntu SMP Thu Sep 16 14:28:38 UTC 2010 i686 GNU/Linux

top

top - 10:55:08 up 15:28,  4 users,  load average: 12.29, 7.01, 3.89
Tasks: 313 total,   3 running, 308 sleeping,   0 stopped,   2 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4210704k total,  2163024k used,  2047680k free,   162320k buffers
Swap:  2096440k total,        0k used,  2096440k free,  1690464k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    1 root      20   0  2804 1644 1204 S    0  0.0   0:00.18 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd
    3 root      RT   0     0    0    0 R    0  0.0   0:00.08 migration/0
    4 root      20   0     0    0    0 S    0  0.0   0:00.01 ksoftirqd/0
    5 root      RT   0     0    0    0 R    0  0.0   0:00.01 watchdog/0
    6 root      20   0     0    0    0 S    0  0.0   0:00.06 events/0

ps axf

  PID TTY      STAT   TIME COMMAND
    2 ?        S      0:00 [kthreadd]
    3 ?        R      0:00  \_ [migration/0]
    4 ?        S      0:00  \_ [ksoftirqd/0]
    5 ?        R      0:00  \_ [watchdog/0]
    6 ?        S      0:00  \_ [events/0]
    7 ?        S      0:00  \_ [cpuset]
    8 ?        S      0:00  \_ [khelper]

<removed a bunch of processes to save space in the post, all had status S>

    1 ?        Ss     0:00 /sbin/init
  245 ?        S      0:00 upstart-udev-bridge --daemon
  251 ?        S /var/run/motd.new
25554 ?        S      0:00          \_ run-parts --lsbsysinit /etc/update-motd.d
25558 ?        S      0:00              \_ /bin/sh /etc/update-motd.d/10-help-text
25560 ?        D      0:00                  \_ /bin/sh /etc/update-motd.d/10-help-text
  852 ?        Ss     0:00 cron
 1374 ?        S      0:00  \_ CRON
 1377 ?        Ss     0:00  |   \_ /bin/sh -c /var/www/secure/caddy2_prod/scripts/main.pl
 1379 ?        S      0:02  |       \_ /usr/bin/perl /var/www/secure/caddy2_prod/scripts/main.pl
 1385 ?        Z      0:00  |           \_ [check.pl] 
 1375 ?        S      0:00  \_ CRON
 1376 ?        Ss     0:00      \_ /bin/sh -c /var/www/secure/caddy2_test/scripts/main.pl
 1378 ?        S      0:00          \_ /usr/bin/perl /var/www/secure/caddy2_test/scripts/main.pl
 1384 ?        Z      0:00              \_ [check.pl] 
  855 ?        Ss     0:00 atd
  868 ?        Ssl    6:36 /usr/sbin/mysqld
  890 ?        S      0:00 /bin/bash /usr/sbin/xe-daemon -p /var/run/xe-daemon.pid
25563 ?        S      0:00  \_ /bin/sh /usr/sbin/xe-update-guest-attrs --memory
25564 ?        D      0:00      \_ /bin/sh /usr/sbin/xe-update-guest-attrs --memory
 1161 ?        Ss     0:00 /usr/lib/postfix/master
 3102 ?        S      0:00  \_ qmgr -l -t fifo -u
22013 ?        S      0:00  \_ pickup -l -t fifo -u -c
 1181 ?        Ssl    3:17 /usr/sbin/asterisk -p -U asterisk
 1182 ?        S      0:00  \_ astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet
 1222 ?        Ss     0:00 /usr/sbin/apache2 -k start
31682 ?        S      0:01  \_ /usr/sbin/apache2 -k start
31716 ?        S      0:01  \_ /usr/sbin/apache2 -k start
13548 ?        S      0:00  \_ /usr/sbin/apache2 -k start
25593 ?        S      0:00  |   \_ /usr/bin/perl -w /usr/lib/cgi-bin/caddy2/patch.pl
25594 ?        D      0:00  |       \_ /usr/bin/perl -w /usr/lib/cgi-bin/caddy2/patch.pl
13637 ?        S      0:00  \_ /usr/sbin/apache2 -k start
16061 ?        S      0:00  \_ /usr/sbin/apache2 -k start
23116 ?        S      0:00  \_ /usr/sbin/apache2 -k start
25565 ?        D      0:00  |   \_ /usr/sbin/apache2 -k start
23117 ?        S      0:00  \_ /usr/sbin/apache2 -k start
23118 ?        S      0:00  \_ /usr/sbin/apache2 -k start
23119 ?        S      0:00  \_ /usr/sbin/apache2 -k start
23121 ?        S      0:00  \_ /usr/sbin/apache2 -k start
 1268 tty1     Ss+    0:00 /sbin/getty -8 38400 tty1
 1396 ?        S      0:00 /usr/local/caddy2/servers/test/caddy2serverd localhost caddy2test 1981

<removed a bunch of processes like the one above to save space in the post, there were about 100, all with status S>

25590 ?        S      0:00  \_ /usr/local/caddy2/servers/prod/caddy2serverd localhost caddy2prod 1991
25538 ?        D      0:00 /bin/bash ./impsys-snap.sh nohup
25596 ?        Ss     0:00 /sbin/getty -L hvc0 9600 linux

I do notice that there are a few process in the D state, which I think indicates that it is a zombie process. I have no idea if those are the cause of the issue, or what the difference is between processes in D state vs. processes in Z state.

If we think that these are what's causing the problem, what can I do about it? I have no idea what causes a process to go into the D state, and therefore have no idea how to prevent it from happening.

Your help is greatly appreciated. Thanks!

UPDATE:

I looked at our kern.log and found that it is littered with messages like these:

<removed to clean up post, further detail added below>

The timestamp on some of these seems to coincide with the time that the server locks up on us, so I'm think that has something to do with it. We have passed this information to our server provider as well, but does this indicate anything useful? If so, does it indicate a problem on my end, or my server provider's end?

UPDATE 2:

Here is the kern.log for the entire time that seemed relevant. At 10:52:24, the load average started going up. I got the output of the ps axf at about 10:54:02. After a minute or two (possibly exactly two minutes, if 10:56:02 indicates it), the system became unresponsive and I could not execute commands. Here is the log:


Mar 25 08:08:57 cloud kernel: [45483.026983] INFO: task apache2:9642 blocked for more than 120 seconds.
Mar 25 08:08:57 cloud kernel: [45483.026986] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 08:08:57 cloud kernel: [45483.026989] apache2       D ea63de60     0  9642   9068 0x00000000
Mar 25 08:08:57 cloud kernel: [45483.026992]  ea63de78 00000282 ea63ded4 ea63de60 c01d9096 00000000 00000000 00000000
Mar 25 08:08:57 cloud kernel: [45483.026996]  c06f61c0 ea1a6fac c06f61c0 c06f61c0 c06f61c0 ea1a6fac c06f61c0 c06f61c0
Mar 25 08:08:57 cloud kernel: [45483.027000]  c1305740 0000296f ea1a6d00 ec844d00 9a352c09 0003c456 ea63dea0 c0106c51
Mar 25 08:08:57 cloud kernel: [45483.027003] Call Trace:
Mar 25 08:08:57 cloud kernel: [45483.027006]  [] ? __link_path_walk+0x626/0xc20
Mar 25 08:08:57 cloud kernel: [45483.027010]  [] ? sched_clock+0x21/0x80
Mar 25 08:08:57 cloud kernel: [45483.027013]  [] schedule_timeout+0x175/0x250
Mar 25 08:08:57 cloud kernel: [45483.027018]  [] ? sched_clock_cpu+0x14d/0x190
Mar 25 08:08:57 cloud kernel: [45483.027021]  [] ? find_idlest_group+0xa8/0x1b0
Mar 25 08:08:57 cloud kernel: [45483.027023]  [] wait_for_common+0xc6/0x180
Mar 25 08:08:57 cloud kernel: [45483.027026]  [] ? default_wake_function+0x0/0x10
Mar 25 08:08:57 cloud kernel: [45483.027028]  [] wait_for_completion+0x12/0x20
Mar 25 08:08:57 cloud kernel: [45483.027031]  [] sched_migrate_task+0xe4/0x100
Mar 25 08:08:57 cloud kernel: [45483.027033]  [] sched_exec+0x3b/0x50
Mar 25 08:08:57 cloud kernel: [45483.027036]  [] do_execve+0xc4/0x360
Mar 25 08:08:57 cloud kernel: [45483.027038]  [] sys_execve+0x28/0x60
Mar 25 08:08:57 cloud kernel: [45483.027041]  [] syscall_call+0x7/0xb
Mar 25 09:27:03 cloud kernel: [50344.466167] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Mar 25 09:27:03 cloud kernel: [50344.466452] CONFIG_NF_CT_ACCT is deprecated and will be removed soon. Please use
Mar 25 09:27:03 cloud kernel: [50344.466454] nf_conntrack.acct=1 kernel parameter, acct=1 nf_conntrack module option or
Mar 25 09:27:03 cloud kernel: [50344.466455] sysctl net.netfilter.nf_conntrack_acct=1 to enable it.
Mar 25 10:52:24 cloud kernel: [55167.785176] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:52:24 cloud kernel: [55167.785202] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:52:24 cloud kernel: [55167.785217] 
Mar 25 10:52:24 cloud kernel: [55167.785221] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu) 
Mar 25 10:52:24 cloud kernel: [55167.785224] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:52:24 cloud kernel: [55167.785228] EIP is at 0xc01013a7
Mar 25 10:52:24 cloud kernel: [55167.785230] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:52:24 cloud kernel: [55167.785232] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:52:24 cloud kernel: [55167.785235]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:52:24 cloud kernel: [55167.785242] CR0: 8005003b CR2: b6620000 CR3: 2a65e000 CR4: 00002660
Mar 25 10:52:24 cloud kernel: [55167.785247] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:52:24 cloud kernel: [55167.785250] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:52:24 cloud kernel: [55167.785252] Call Trace:
Mar 25 10:52:24 cloud kernel: [55167.785259]  [] ? xen_safe_halt+0x15/0x40
Mar 25 10:52:24 cloud kernel: [55167.785264]  [] xen_idle+0x29/0x80
Mar 25 10:52:24 cloud kernel: [55167.785267]  [] cpu_idle+0x8f/0xc0
Mar 25 10:52:24 cloud kernel: [55167.785272]  [] rest_init+0x53/0x60
Mar 25 10:52:24 cloud kernel: [55167.785278]  [] start_kernel+0x379/0x37f
Mar 25 10:52:24 cloud kernel: [55167.785282]  [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:52:24 cloud kernel: [55167.785286]  [] i386_start_kernel+0x67/0x6e
Mar 25 10:53:30 cloud kernel: [55233.281412] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:53:30 cloud kernel: [55233.281421] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:53:30 cloud kernel: [55233.281444] 
Mar 25 10:53:30 cloud kernel: [55233.281449] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu) 
Mar 25 10:53:30 cloud kernel: [55233.281453] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:53:30 cloud kernel: [55233.281457] EIP is at 0xc01013a7
Mar 25 10:53:30 cloud kernel: [55233.281460] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:53:30 cloud kernel: [55233.281463] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:53:30 cloud kernel: [55233.281466]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:53:30 cloud kernel: [55233.281474] CR0: 8005003b CR2: 09827024 CR3: 013c4000 CR4: 00002660
Mar 25 10:53:30 cloud kernel: [55233.281480] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:53:30 cloud kernel: [55233.281484] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:53:30 cloud kernel: [55233.281487] Call Trace:
Mar 25 10:53:30 cloud kernel: [55233.281498]  [] ? xen_safe_halt+0x15/0x40
Mar 25 10:53:30 cloud kernel: [55233.281504]  [] xen_idle+0x29/0x80
Mar 25 10:53:30 cloud kernel: [55233.281509]  [] cpu_idle+0x8f/0xc0
Mar 25 10:53:30 cloud kernel: [55233.281516]  [] rest_init+0x53/0x60
Mar 25 10:53:30 cloud kernel: [55233.281524]  [] start_kernel+0x379/0x37f
Mar 25 10:53:30 cloud kernel: [55233.281529]  [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:53:30 cloud kernel: [55233.281535]  [] i386_start_kernel+0x67/0x6e
Mar 25 10:54:36 cloud kernel: [55298.785478] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:54:36 cloud kernel: [55298.785538] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:54:36 cloud kernel: [55298.785551] 
Mar 25 10:54:36 cloud kernel: [55298.785554] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu) 
Mar 25 10:54:36 cloud kernel: [55298.785556] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:54:36 cloud kernel: [55298.785560] EIP is at 0xc01013a7
Mar 25 10:54:36 cloud kernel: [55298.785561] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:54:36 cloud kernel: [55298.785563] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:54:36 cloud kernel: [55298.785565]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:54:36 cloud kernel: [55298.785570] CR0: 8005003b CR2: 08ddb00c CR3: 28e7a000 CR4: 00002660
Mar 25 10:54:36 cloud kernel: [55298.785573] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:54:36 cloud kernel: [55298.785575] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:54:36 cloud kernel: [55298.785576] Call Trace:
Mar 25 10:54:36 cloud kernel: [55298.785584]  [] ? xen_safe_halt+0x15/0x40
Mar 25 10:54:36 cloud kernel: [55298.785588]  [] xen_idle+0x29/0x80
Mar 25 10:54:36 cloud kernel: [55298.785591]  [] cpu_idle+0x8f/0xc0
Mar 25 10:54:36 cloud kernel: [55298.785596]  [] rest_init+0x53/0x60
Mar 25 10:54:36 cloud kernel: [55298.785602]  [] start_kernel+0x379/0x37f
Mar 25 10:54:36 cloud kernel: [55298.785605]  [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:54:36 cloud kernel: [55298.785608]  [] i386_start_kernel+0x67/0x6e
Mar 25 10:54:57 cloud kernel: [55318.911014] INFO: task impsys-snap.sh:25538 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911061] impsys-snap.s D e8ac5e60     0 25538      1 0x00000004
Mar 25 10:54:57 cloud kernel: [55318.911065]  e8ac5e78 00000282 ec53bb64 e8ac5e60 c01d9498 00000000 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911072]  c06f61c0 c13088dc c06f61c0 c06f61c0 c06f61c0 c13088dc c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911077]  ea1d5040 00003221 c1308630 c068d280 c012abc8 0003cd53 0004ae99 e8ac5eac
Mar 25 10:54:57 cloud kernel: [55318.911083] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911092]  [] ? __link_path_walk+0xa28/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911098]  [] ? inc_rt_group+0xf8/0x110
Mar 25 10:54:57 cloud kernel: [55318.911103]  [] ? update_curr+0x169/0x2c0
Mar 25 10:54:57 cloud kernel: [55318.911114]  [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911117]  [] ? check_preempt_wakeup+0x152/0x370
Mar 25 10:54:57 cloud kernel: [55318.911120]  [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911122]  [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911125]  [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911128]  [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911130]  [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911134]  [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911137]  [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911139]  [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911142] INFO: task lesspipe:25544 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911145] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911148] lesspipe      D e9487e60     0 25544  25543 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911151]  e9487e78 00000286 e9487ed4 e9487e60 c01d9096 c01e1204 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911155]  c06f61c0 ec3bb4ec c06f61c0 c06f61c0 c06f61c0 ec3bb4ec c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911159]  ec269900 00003221 ec3bb240 c068d280 6d9860b9 0003cd53 e9487ea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911163] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911165]  [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911169]  [] ? dput+0x84/0x160
Mar 25 10:54:57 cloud kernel: [55318.911172]  [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911175]  [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911179]  [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911181]  [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911184]  [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911187]  [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911189]  [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911192]  [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911194]  [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911197]  [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911199]  [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911201]  [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911204] INFO: task 10-help-text:25560 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911206] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911209] 10-help-text  D e8995e60     0 25560  25558 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911211]  e8995e78 00000282 e8995ed4 e8995e60 c01d9096 00000000 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911215]  c06f61c0 e8af122c c06f61c0 c06f61c0 c06f61c0 e8af122c c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911219]  ec295e40 00003229 e8af0f80 c068d280 3d76c004 0003cd5c e8995ea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911223] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911225]  [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911228]  [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911231]  [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911233]  [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911236]  [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911238]  [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911241]  [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911243]  [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911246]  [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911248]  [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911251]  [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911253]  [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911256]  [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911258] INFO: task xe-update-guest:25564 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911260] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911263] xe-update-gue D e9535e60     0 25564  25563 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911265]  e9535e78 00000286 e9535ed4 e9535e60 c01d9096 c01e1204 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911269]  c06f61c0 ea09685c c06f61c0 c06f61c0 c06f61c0 ea09685c c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911273]  ec269ac0 0000322b ea0965b0 c068d280 f58f849c 0003cd5d e9535ea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911277] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911279]  [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911282]  [] ? dput+0x84/0x160
Mar 25 10:54:57 cloud kernel: [55318.911284]  [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911287]  [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911290]  [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911292]  [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911294]  [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911297]  [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911299]  [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911302]  [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911305]  [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911307]  [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911310]  [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911312]  [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911314] INFO: task apache2:25565 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911316] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911319] apache2       D e899be60     0 25565  23116 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911321]  e899be78 00000282 e899bed4 e899be60 c01d9096 c068afa0 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911325]  c06f61c0 e949522c c06f61c0 c06f61c0 c06f61c0 e949522c c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911329]  ec31f040 0000322c e9494f80 c068d280 1fcfd913 0003cd5f e899bea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911333] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911336]  [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911338]  [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911341]  [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911343]  [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911346]  [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911348]  [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911351]  [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911353]  [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911356]  [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911359]  [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911361]  [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911364]  [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911366]  [] syscall_call+0x7/0xb
Mar 25 10:56:02 cloud kernel: [55383.610034] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:56:02 cloud kernel: [55383.610056] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:56:02 cloud kernel: [55383.610073] 
Mar 25 10:56:02 cloud kernel: [55383.610077] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu) 
Mar 25 10:56:02 cloud kernel: [55383.610080] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:56:02 cloud kernel: [55383.610084] EIP is at 0xc01013a7
Mar 25 10:56:02 cloud kernel: [55383.610086] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:56:02 cloud kernel: [55383.610089] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:56:02 cloud kernel: [55383.610091]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:56:02 cloud kernel: [55383.610097] CR0: 8005003b CR2: 098c801c CR3: 2a6d2000 CR4: 00002660
Mar 25 10:56:02 cloud kernel: [55383.610104] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:56:02 cloud kernel: [55383.610106] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:56:02 cloud kernel: [55383.610107] Call Trace:
Mar 25 10:56:02 cloud kernel: [55383.610115]  [] ? xen_safe_halt+0x15/0x40
Mar 25 10:56:02 cloud kernel: [55383.610119]  [] xen_idle+0x29/0x80
Mar 25 10:56:02 cloud kernel: [55383.610122]  [] cpu_idle+0x8f/0xc0
Mar 25 10:56:02 cloud kernel: [55383.610127]  [] rest_init+0x53/0x60
Mar 25 10:56:02 cloud kernel: [55383.610133]  [] start_kernel+0x379/0x37f
Mar 25 10:56:02 cloud kernel: [55383.610136]  [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:56:02 cloud kernel: [55383.610139]  [] i386_start_kernel+0x67/0x6e

Solution 1:

D is uninterruptible wait, which once upon a time was Disk wait, but these days often has more to do with waiting for network filesystems. These processes count in the load average, which may be your problem.

The thing that jumps out at me from the ps output is the second-last line: a filesystem (NAS?) is being snapshotted, and probably all disk activity to that filesystem is being blocked during the snapshot. Ways to deal with this include but aren't limited to using a cachefs over it and snapshotting more often so they go faster; which to use depends on the NAS (some of them suck during snapshots no matter what), available disk space, and your needs. I would start by looking for s support community (not necessarily vendor support) for the NAS to see what tricks other users have come up with to minimize snapshot latency.

Solution 2:

In the end, the problem was actually hardware-related. We migrated to a different server and POOF, the problem is gone. At least it made us take a good hard look at possible performance issues in our own programs. We found a few things that would have probably gone unnoticed, but should be cleaned up.