how can I diagnose a "frozen" linux software raid device?

I have a server running Linux 3.2.12 32-bit i686 with 13 drives: 1 boot drive, and 3 raid5 devices of 4 drives each.

/proc/mdstat shows

Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] 
md2 : active raid5 sdd1[3] sdc1[2] sdb1[1] sda1[0]
    5860535808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid5 sdk1[3] sdj1[2] sdi1[1] sdh1[0]
    4395407808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md3 : active raid5 sdl1[0] sdm1[1] sdf1[3] sde1[2]
    5860535808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>

My problem is that for the second time in three days, one of the raid drives is causing any process that tries to read from it to lock up. No signal is able to terminate these process, and I have to reboot to get it working again. However, the drives seem fine after the reboot and the raid status seems fine, and the kernel log doesn't have any useful error messages other than that processes are hung.

I've run smartctl on all the drives in question, and they seem fine.

What else can I check to try and diagnose this?

Here's excepts of the kernel log that look semi-interesting. But note the "can't send ioctl to partition" has been around forever, and searches yielded that it was a harmless warning.

Every 900 seconds:

...
Aug 20 18:34:01 [kernel] [  931.249505] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:01 [kernel] [ 1831.302297] scsi_verify_blk_ioctl: 2 callbacks suppressed
Aug 20 18:49:01 [kernel] [ 1831.302300] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:01 [kernel] [ 1831.302302] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:01 [kernel] [ 1831.302774] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:01 [kernel] [ 1831.302776] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.333538] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.333540] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.358068] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.358071] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.414331] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.414334] mdadm: sending ioctl 1261 to a partition!
Aug 20 19:04:01 [kernel] [ 2731.070794] scsi_verify_blk_ioctl: 2 callbacks suppressed
...

About the time the problem shows up:

Aug 21 13:38:32 [kernel] [69601.312055] INFO: task kjournald:26008 blocked for more than 600 seconds.
Aug 21 13:38:32 [kernel] [69601.312057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 21 13:38:32 [kernel] [69601.312059] kjournald       D 00000000     0 26008      2 0x00000000
Aug 21 13:38:32 [kernel] [69601.312063]  eb5ccc80 00000046 00000000 00000000 00000000 e81e0070 e81e020c f6205900
Aug 21 13:38:32 [kernel] [69601.312068]  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Aug 21 13:38:32 [kernel] [69601.312072]  00000000 00000000 00000000 00000000 00000000 00000001 c0b66230 e81e0280
Aug 21 13:38:32 [kernel] [69601.312077] Call Trace:
Aug 21 13:38:32 [kernel] [69601.312083]  [<c013cbe5>] ? prepare_to_wait+0x15/0x55
Aug 21 13:38:32 [kernel] [69601.312088]  [<c0217df5>] ? journal_commit_transaction+0xdb/0xca6
Aug 21 13:38:32 [kernel] [69601.312090]  [<c013ca68>] ? wake_up_bit+0x16/0x16
Aug 21 13:38:32 [kernel] [69601.312093]  [<c0132c3d>] ? lock_timer_base+0x19/0x35
Aug 21 13:38:32 [kernel] [69601.312095]  [<c0132cb8>] ? try_to_del_timer_sync+0x5f/0x65
Aug 21 13:38:32 [kernel] [69601.312098]  [<c021ade6>] ? kjournald+0xa6/0x1a2
Aug 21 13:38:32 [kernel] [69601.312101]  [<c013ca68>] ? wake_up_bit+0x16/0x16
Aug 21 13:38:32 [kernel] [69601.312103]  [<c021ad40>] ? journal_grab_journal_head+0x31/0x31
Aug 21 13:38:32 [kernel] [69601.312106]  [<c013c778>] ? kthread+0x65/0x6a
Aug 21 13:38:32 [kernel] [69601.312108]  [<c013c713>] ? kthread_stop+0x47/0x47
Aug 21 13:38:32 [kernel] [69601.312111]  [<c0830b36>] ? kernel_thread_helper+0x6/0xd

First upgrade your kernel. That particular kernel contained a bug which caused various ioctls to print those warnings (and maybe fail) in certain mdraid and LVM configurations.

If a fixed kernel doesn't resolve the problem, run an extended self-test on all your drives. Note that the self-test may take several hours for each drive and will degrade performance slightly while running, so should be run at a time of low system activity. For example, to schedule the self-tests to begin at 11 pm:

at 11 pm <<JOB
for drive in /dev/sd?
do
    smartctl -t long $drive || :
done
JOB

Later the next day, check the test results:

for drive in /dev/sd?
do
    echo Test results for drive $drive
    smartctl -l selftest $drive || :
done

If the kernel update didn't fix the problem, then you may find a drive that failed the self-test.

If you don't find a drive that failed the self-test, check the drive attributes anyway.

for drive in /dev/sd?
do
    echo Attributes for drive $drive
    smartctl -A $drive || :
done

Note that some of these attributes may indicate problems even if they are not marked as failed; so find an expert to examine them, or attach them to your question.

how can I diagnose a "frozen" linux software raid device?

Related

Recent Posts