DegradedArray event on /dev/md1

This morning I got this message:

This is an automatically generated mail message from mdadm
running on 

A DegradedArray event had been detected on md device /dev/md1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1]
md1 : active raid1 sdb3[2](F) sda3[1]
      1860516800 blocks [2/1] [_U]

md0 : active raid1 sdb1[0] sda1[1]
      499904 blocks [2/2] [UU]

unused devices: <none>

Does it mean that 1 of the hard drives is not working anymore? How can I fix this problem? Should I ask the data center to replace the hard drive? Can I try to re-add the missing device? If yes, what command should I run and is it safe to re-add? I just don't want my server to go offline.

serv397:/var/log# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb3[2](F) sda3[1]
      1860516800 blocks [2/1] [_U]

md0 : active raid1 sdb1[0] sda1[1]
      499904 blocks [2/2] [UU]

unused devices: <none>

serv397:/var/log# mdadm -D /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Sun Apr 29 22:51:51 2012
     Raid Level : raid1
     Array Size : 1860516800 (1774.33 GiB 1905.17 GB)
  Used Dev Size : 1860516800 (1774.33 GiB 1905.17 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Sat Feb 23 09:26:39 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : ec02d5ce:8554d4ad:7792c71e:7dc17aa4
         Events : 0.11225668

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8        3        1      active sync   /dev/sda3

       2       8       19        -      faulty spare   /dev/sdb3


kern.log

Feb 23 09:00:58 triton1017 kernel: [24015352.812156] __ratelimit: 134 callbacks suppressed
Feb 23 09:00:58 triton1017 kernel: [24015352.812165] mdadm: sending ioctl 1261 to a partition!
Feb 23 09:00:58 triton1017 kernel: [24015352.812172] mdadm: sending ioctl 1261 to a partition!


mdam:

[    1.929981] mdadm: sending ioctl 1261 to a partition!
[    1.930211] mdadm: sending ioctl 800c0910 to a partition!
[    1.930241] mdadm: sending ioctl 800c0910 to a partition!
[    1.944515] md: md0 stopped.
[    1.945700] md: bind<sda1>
[    1.945944] md: bind<sdb1>
[    1.947709] raid1: raid set md0 active with 2 out of 2 mirrors
[    1.947784] md0: detected capacity change from 0 to 511901696
[    1.948516]  md0: unknown partition table
[    1.984932] md: md1 stopped.
[    1.986131] md: bind<sda3>
[    1.986332] md: bind<sdb3>
[    1.987377] raid1: raid set md1 active with 2 out of 2 mirrors
[    1.987421] md1: detected capacity change from 0 to 1905169203200
[    1.988287]  md1: unknown partition table
[    2.164118] kjournald starting.  Commit interval 5 seconds
[    2.164130] EXT3-fs: mounted filesystem with ordered data mode.
[    3.181350] udev[346]: starting version 164
[    3.644863] input: PC Speaker as /devices/platform/pcspkr/input/input3
[    3.654062] Error: Driver 'pcspkr' is already registered, aborting...
[    3.663045] piix4_smbus 0000:00:14.0: SMBus Host Controller at 0xb00, revision 0
[    3.810284] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[    3.812865] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[    3.860102] [drm] Initialized drm 1.1.0 20060810
[    3.884550] hda-intel: no codecs found!
[    3.884672] HDA Intel 0000:01:05.1: setting latency timer to 64
[    3.925197] [drm] radeon defaulting to userspace modesetting.
[    3.925973] pci 0000:01:05.0: setting latency timer to 64
[    3.926082] [drm] Initialized radeon 1.32.0 20080528 for 0000:01:05.0 on minor 0
[    4.123784] Adding 1998840k swap on /dev/sda2.  Priority:-1 extents:1 across:1998840k
[    4.126482] Adding 1998840k swap on /dev/sdb2.  Priority:-2 extents:1 across:1998840k
[    4.332550] EXT3 FS on md1, internal journal
[    5.247285]   alloc irq_desc for 25 on node -1
[    5.247287]   alloc kstat_irqs on node -1
[    5.247299] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
[    5.275326] ADDRCONF(NETDEV_UP): eth0: link is not ready

Tried to readd:

sudo mdadm --re-add /dev/md1 /dev/sdb3
mdadm: Cannot open /dev/sdb3: Device or resource busy
sudo mdadm --remove /dev/md1 /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md1
sudo mdadm --add /dev/md1 /dev/sdb3
mdadm: re-added /dev/sdb3

/var/log# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb3[2] sda3[1]
      1860516800 blocks [2/1] [_U]
      [>....................]  recovery =  0.1% (2849024/1860516800) finish=455.9min speed=67898K/sec

md0 : active raid1 sdb1[0] sda1[1]
      499904 blocks [2/2] [UU]

unused devices: <none>

Re-syncing didn't solve the problem:

triton1017:/var/log# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb3[2](S) sda3[1]
      1860516800 blocks [2/1] [_U]

md0 : active raid1 sdb1[0] sda1[1]
      499904 blocks [2/2] [UU]

unused devices: <none>

triton1017:/var/log# mdadm -D /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Sun Apr 29 22:51:51 2012
     Raid Level : raid1
     Array Size : 1860516800 (1774.33 GiB 1905.17 GB)
  Used Dev Size : 1860516800 (1774.33 GiB 1905.17 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Sat Feb 23 18:14:08 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

           UUID : ec02d5ce:8554d4ad:7792c71e:7dc17aa4
         Events : 0.11245156

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8        3        1      active sync   /dev/sda3

       2       8       19        -      spare   /dev/sdb3


kern.log files shows the following:

Feb 23 14:55:19 triton1017 kernel: [24036613.378608] ata1.00: error: { UNC }
Feb 23 14:55:19 triton1017 kernel: [24036613.398590] ata1.00: configured for UDMA/133
Feb 23 14:55:19 triton1017 kernel: [24036613.398627] ata1: EH complete
Feb 23 14:55:21 triton1017 kernel: [24036616.262518] ata1.00: exception Emask 0x0 SAct 0x1dfbe SErr 0x0 action 0x0
Feb 23 14:55:21 triton1017 kernel: [24036616.262525] ata1.00: irq_stat 0x40000008
Feb 23 14:55:21 triton1017 kernel: [24036616.262531] ata1.00: failed command: READ FPDMA QUEUED
Feb 23 14:55:21 triton1017 kernel: [24036616.262539] ata1.00: cmd 60/80:28:00:5a:b4/00:00:75:00:00/40 tag 5 ncq 65536 in
Feb 23 14:55:21 triton1017 kernel: [24036616.262540]          res 41/40:80:38:5a:b4/00:00:75:00:00/00 Emask 0x409 (media error) <F>
Feb 23 14:57:16 triton1017 kernel: [24036730.503323] ata1.00: status: { DRDY ERR }
Feb 23 14:57:16 triton1017 kernel: [24036730.503328] ata1.00: error: { UNC }
Feb 23 14:57:16 triton1017 kernel: [24036730.523346] ata1.00: configured for UDMA/133
Feb 23 14:57:16 triton1017 kernel: [24036730.523356] ata1: EH complete
Feb 23 14:57:17 triton1017 kernel: [24036732.116026] INFO: task mysqld:6067 blocked for more than 120 seconds.
Feb 23 14:57:17 triton1017 kernel: [24036732.116032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 23 14:57:17 triton1017 kernel: [24036732.116040] mysqld        D 0000000000000002     0  6067    938 0x00000000
Feb 23 14:57:17 triton1017 kernel: [24036732.116049]  ffffffff814891f0 0000000000000086 0000000000000000 00000000ffffffff
Feb 23 14:57:17 triton1017 kernel: [24036732.117353]  ffff880016dcfc00 0000000000015780 000000000000f9e0 ffff8805c4c65fd8
Feb 23 14:57:17 triton1017 kernel: [24036732.117367]  0000000000015780 0000000000015780 ffff880618825bd0 ffff880618825ec8
Feb 23 14:57:17 triton1017 kernel: [24036732.117380] Call Trace:
Feb 23 14:57:17 triton1017 kernel: [24036732.117391]  [<ffffffff810168f3>] ? read_tsc+0xa/0x20
Feb 23 14:57:17 triton1017 kernel: [24036732.117400]  [<ffffffff8110e656>] ? sync_buffer+0x0/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117408]  [<ffffffff812fbb4a>] ? io_schedule+0x73/0xb7
Feb 23 14:57:17 triton1017 kernel: [24036732.117419]  [<ffffffff8110e691>] ? sync_buffer+0x3b/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117426]  [<ffffffff812fbf5a>] ? __wait_on_bit_lock+0x3f/0x84
Feb 23 14:57:17 triton1017 kernel: [24036732.117433]  [<ffffffff8110e656>] ? sync_buffer+0x0/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117441]  [<ffffffff812fc00a>] ? out_of_line_wait_on_bit_lock+0x6b/0x77
Feb 23 14:57:17 triton1017 kernel: [24036732.117451]  [<ffffffff81065070>] ? wake_bit_function+0x0/0x23
Feb 23 14:57:17 triton1017 kernel: [24036732.117459]  [<ffffffff8110ea83>] ? sync_dirty_buffer+0x29/0x93
Feb 23 14:57:17 triton1017 kernel: [24036732.117474]  [<ffffffffa018ce04>] ? journal_dirty_data+0xd1/0x1b0 [jbd]
Feb 23 14:57:17 triton1017 kernel: [24036732.117486]  [<ffffffffa01a3f1f>] ? ext3_journal_dirty_data+0xf/0x34 [ext3]
Feb 23 14:57:17 triton1017 kernel: [24036732.117499]  [<ffffffffa01a23f9>] ? walk_page_buffers+0x65/0x8b [ext3]
Feb 23 14:57:17 triton1017 kernel: [24036732.117510]  [<ffffffffa01a3f44>] ? journal_dirty_data_fn+0x0/0x13 [ext3]
Feb 23 14:57:17 triton1017 kernel: [24036732.117521]  [<ffffffffa01a5a66>] ? ext3_ordered_write_end+0x73/0x10f [ext3]
Feb 23 14:57:17 triton1017 kernel: [24036732.117532]  [<ffffffffa01b0bbb>] ? ext3_xattr_get+0x1ef/0x271 [ext3]
Feb 23 14:57:17 triton1017 kernel: [24036732.117542]  [<ffffffff810b517e>] ? generic_file_buffered_write+0x18d/0x278
Feb 23 14:57:17 triton1017 kernel: [24036732.117552]  [<ffffffff810b561a>] ? __generic_file_aio_write+0x25f/0x293
Feb 23 14:57:17 triton1017 kernel: [24036732.117560]  [<ffffffff810b56a7>] ? generic_file_aio_write+0x59/0x9f
Feb 23 14:57:17 triton1017 kernel: [24036732.117569]  [<ffffffff810eef1a>] ? do_sync_write+0xce/0x113
Feb 23 14:57:17 triton1017 kernel: [24036732.117577]  [<ffffffff81103a85>] ? mntput_no_expire+0x23/0xee
Feb 23 14:57:17 triton1017 kernel: [24036732.117584]  [<ffffffff81065042>] ? autoremove_wake_function+0x0/0x2e
Feb 23 14:57:17 triton1017 kernel: [24036732.117593]  [<ffffffff812fce69>] ? _spin_lock_bh+0x9/0x25
Feb 23 14:57:17 triton1017 kernel: [24036732.117600]  [<ffffffff810ef86c>] ? vfs_write+0xa9/0x102
Feb 23 14:57:17 triton1017 kernel: [24036732.117607]  [<ffffffff810ef91c>] ? sys_pwrite64+0x57/0x77
Feb 23 14:57:17 triton1017 kernel: [24036732.117615]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Feb 23 14:57:17 triton1017 kernel: [24036732.117622] INFO: task flush-9:1:1456 blocked for more than 120 seconds.
Feb 23 14:57:17 triton1017 kernel: [24036732.117628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 23 14:57:17 triton1017 kernel: [24036732.117636] flush-9:1     D 0000000000000000     0  1456      2 0x00000000
Feb 23 14:57:17 triton1017 kernel: [24036732.117645]  ffffffff814891f0 0000000000000046 0000000000000000 0000000000000001
Feb 23 14:57:17 triton1017 kernel: [24036732.117659]  0000000000000086 ffffffff8104a45a 000000000000f9e0 ffff88061905dfd8
Feb 23 14:57:17 triton1017 kernel: [24036732.117672]  0000000000015780 0000000000015780 ffff8806190646a0 ffff880619064998
Feb 23 14:57:17 triton1017 kernel: [24036732.117683] Call Trace:
Feb 23 14:57:17 triton1017 kernel: [24036732.117691]  [<ffffffff8104a45a>] ? try_to_wake_up+0x289/0x29b
Feb 23 14:57:17 triton1017 kernel: [24036732.117701]  [<ffffffff8119255f>] ? radix_tree_tag_clear+0x93/0xf1
Feb 23 14:57:17 triton1017 kernel: [24036732.117709]  [<ffffffff8110e656>] ? sync_buffer+0x0/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117716]  [<ffffffff812fbb4a>] ? io_schedule+0x73/0xb7
Feb 23 14:57:17 triton1017 kernel: [24036732.117724]  [<ffffffff8110e691>] ? sync_buffer+0x3b/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117731]  [<ffffffff812fbf5a>] ? __wait_on_bit_lock+0x3f/0x84
Feb 23 14:57:17 triton1017 kernel: [24036732.117738]  [<ffffffff8110e656>] ? sync_buffer+0x0/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117745]  [<ffffffff812fc00a>] ? out_of_line_wait_on_bit_lock+0x6b/0x77
Feb 23 14:57:17 triton1017 kernel: [24036732.117753]  [<ffffffff81065070>] ? wake_bit_function+0x0/0x23
Feb 23 14:57:17 triton1017 kernel: [24036732.117762]  [<ffffffff8110fa23>] ? __block_write_full_page+0x159/0x2ac

Solution 1:

You can try to re-add the failed member to the mdadm array using the following commands:

sudo mdadm --re-add /dev/md1 /dev/sdb3

If you got resource or device busy error, you can try the following:

sudo mdadm --remove /dev/md1 /dev/sdb3
sudo mdadm --add /dev/md1 /dev/sdb3

If you tried them and got an error, please post the error message to get help.

Solution 2:

That disk is actually defective. Have it replaced. Re-sync after replacing the disk with a good one.