Why is my mdadm raid-1 recovery so slow?

On a system I'm running Ubuntu 10.04. My raid-1 restore started out fast but quickly became ridiculously slow (at this rate the restore will take 150 days!):

dimmer@paimon:~$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdc1[2] sdb1[1]
      1953513408 blocks [2/1] [_U]
      [====>................]  recovery = 24.4% (477497344/1953513408) finish=217368.0min speed=113K/sec

unused devices: <none>

Eventhough I have set the kernel variables to reasonably quick values:

dimmer@paimon:~$ cat /proc/sys/dev/raid/speed_limit_min
1000000

dimmer@paimon:~$ cat /proc/sys/dev/raid/speed_limit_max
100000000

I am using 2 2.0TB Western Digital Hard Disks, WDC WD20EARS-00M and WDC WD20EARS-00J. I believe they have been partitioned such that their sectors are aligned.

dimmer@paimon:/sys$ sudo parted /dev/sdb
GNU Parted 2.2
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.

(parted) p                                                                
Model: ATA WDC WD20EARS-00M (scsi)
Disk /dev/sdb: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name  Flags
 1      1049kB  2000GB  2000GB  ext4

(parted) unit s

(parted) p                                                                

Number  Start  End          Size         File system  Name  Flags
 1      2048s  3907028991s  3907026944s  ext4

(parted) q                                                                
dimmer@paimon:/sys$ sudo parted /dev/sdc
GNU Parted 2.2
Using /dev/sdc
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p                                                                
Model: ATA WDC WD20EARS-00J (scsi)
Disk /dev/sdc: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name  Flags
 1      1049kB  2000GB  2000GB  ext4

I am beginning to think that I have a hardware problem, otherwise I can't imagine why the mdadm restore should be so slow.

I have done a benchmark on /dev/sdc using Ubuntu's disk utility GUI app, and the results looked normal so I know that sdc has the capability to write faster than this. I also had the same problem on a similar WD drive that I RMAd because of bad sectors. I suppose it's possible they sent me a replacement with bad sectors too, although there are no SMART values showing them yet.

Any ideas? Thanks.

As requested, output of top sorted by cpu usage (notice there is ~0 cpu usage). iowait is also zero which seems strange:

top - 11:35:13 up 2 days,  9:40,  3 users,  load average: 2.87, 2.58, 2.30
Tasks: 142 total,   1 running, 141 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.2%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3096304k total,  1482164k used,  1614140k free,   617672k buffers
Swap:  1526132k total,        0k used,  1526132k free,   535416k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                  
   45 root      20   0     0    0    0 S    0  0.0   2:17.02 scsi_eh_0                                                                                 
    1 root      20   0  2808 1752 1204 S    0  0.1   0:00.46 init                                                                                      
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd                                                                                  
    3 root      RT   0     0    0    0 S    0  0.0   0:00.02 migration/0                                                                               
    4 root      20   0     0    0    0 S    0  0.0   0:00.17 ksoftirqd/0                                                                               
    5 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0                                                                                
    6 root      RT   0     0    0    0 S    0  0.0   0:00.02 migration/1
  ...                                                               

dmesg errors, definitely looking like hardware:

[202884.000157] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[202884.007015] ata5.00: failed command: FLUSH CACHE EXT
[202884.013728] ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[202884.013730]          res 40/00:00:ff:59:2e/00:00:35:00:00/e0 Emask 0x4 (timeout)
[202884.033667] ata5.00: status: { DRDY }
[202884.040329] ata5: hard resetting link
[202889.400050] ata5: link is slow to respond, please be patient (ready=0)
[202894.048087] ata5: COMRESET failed (errno=-16)
[202894.054663] ata5: hard resetting link
[202899.412049] ata5: link is slow to respond, please be patient (ready=0)
[202904.060107] ata5: COMRESET failed (errno=-16)
[202904.066646] ata5: hard resetting link
[202905.840056] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[202905.849178] ata5.00: configured for UDMA/133
[202905.849188] ata5: EH complete
[203899.000292] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[203899.007096] ata5.00: failed command: IDENTIFY DEVICE
[203899.013841] ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
[203899.013843]          res 40/00:00:ff:f9:f6/00:00:38:00:00/e0 Emask 0x4 (timeout)
[203899.041232] ata5.00: status: { DRDY }
[203899.048133] ata5: hard resetting link
[203899.816134] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[203899.826062] ata5.00: configured for UDMA/133
[203899.826079] ata5: EH complete
[204375.000200] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[204375.007421] ata5.00: failed command: IDENTIFY DEVICE
[204375.014799] ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
[204375.014800]          res 40/00:00:ff:0c:0f/00:00:39:00:00/e0 Emask 0x4 (timeout)
[204375.044374] ata5.00: status: { DRDY }
[204375.051842] ata5: hard resetting link
[204380.408049] ata5: link is slow to respond, please be patient (ready=0)
[204384.440076] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[204384.449938] ata5.00: configured for UDMA/133
[204384.449955] ata5: EH complete
[204395.988135] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[204395.988140] ata5.00: failed command: IDENTIFY DEVICE
[204395.988147] ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
[204395.988149]          res 40/00:00:ff:0c:0f/00:00:39:00:00/e0 Emask 0x4 (timeout)
[204395.988151] ata5.00: status: { DRDY }
[204395.988156] ata5: hard resetting link
[204399.320075] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[204399.330487] ata5.00: configured for UDMA/133
[204399.330503] ata5: EH complete

Solution 1:

Disable write cache on hdd's in raid hdparm -W 0 /dev/sdX. Also try dont load disks that are rebuilding.

You can also try to speed limit your sata to 150 MB (look's like they are connected to motherboards sata ports, that lags).

Solution 2:

The problem ended up being caused by hardware. Eventually I unscrewed all the HDDs from their chassis and moved them to new locations in the tower- and I tried unplugging peripherals that I knew I would not be needing. This fixed the problem. The raid array quickly rebuilt and my raid array has been stable for months.

Unfortunately I do not know what specifically caused the problem. I am guessing that something put noise into a data link, but I really have no clue. My motherboard is an Asus P5WDG2 WS Pro, in case anyone else has a similar problem with this motherboard.

Thanks to everyone who tried to help - this ended up being an edge case.