DegradedArray event on /dev/md1
This morning I got this message:
This is an automatically generated mail message from mdadm
running on
A DegradedArray event had been detected on md device /dev/md1.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1]
md1 : active raid1 sdb3[2](F) sda3[1]
1860516800 blocks [2/1] [_U]
md0 : active raid1 sdb1[0] sda1[1]
499904 blocks [2/2] [UU]
unused devices: <none>
Does it mean that 1 of the hard drives is not working anymore? How can I fix this problem? Should I ask the data center to replace the hard drive? Can I try to re-add the missing device? If yes, what command should I run and is it safe to re-add? I just don't want my server to go offline.
serv397:/var/log# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb3[2](F) sda3[1]
1860516800 blocks [2/1] [_U]
md0 : active raid1 sdb1[0] sda1[1]
499904 blocks [2/2] [UU]
unused devices: <none>
serv397:/var/log# mdadm -D /dev/md1
/dev/md1:
Version : 0.90
Creation Time : Sun Apr 29 22:51:51 2012
Raid Level : raid1
Array Size : 1860516800 (1774.33 GiB 1905.17 GB)
Used Dev Size : 1860516800 (1774.33 GiB 1905.17 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Sat Feb 23 09:26:39 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
UUID : ec02d5ce:8554d4ad:7792c71e:7dc17aa4
Events : 0.11225668
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 3 1 active sync /dev/sda3
2 8 19 - faulty spare /dev/sdb3
kern.log
Feb 23 09:00:58 triton1017 kernel: [24015352.812156] __ratelimit: 134 callbacks suppressed
Feb 23 09:00:58 triton1017 kernel: [24015352.812165] mdadm: sending ioctl 1261 to a partition!
Feb 23 09:00:58 triton1017 kernel: [24015352.812172] mdadm: sending ioctl 1261 to a partition!
mdam:
[ 1.929981] mdadm: sending ioctl 1261 to a partition!
[ 1.930211] mdadm: sending ioctl 800c0910 to a partition!
[ 1.930241] mdadm: sending ioctl 800c0910 to a partition!
[ 1.944515] md: md0 stopped.
[ 1.945700] md: bind<sda1>
[ 1.945944] md: bind<sdb1>
[ 1.947709] raid1: raid set md0 active with 2 out of 2 mirrors
[ 1.947784] md0: detected capacity change from 0 to 511901696
[ 1.948516] md0: unknown partition table
[ 1.984932] md: md1 stopped.
[ 1.986131] md: bind<sda3>
[ 1.986332] md: bind<sdb3>
[ 1.987377] raid1: raid set md1 active with 2 out of 2 mirrors
[ 1.987421] md1: detected capacity change from 0 to 1905169203200
[ 1.988287] md1: unknown partition table
[ 2.164118] kjournald starting. Commit interval 5 seconds
[ 2.164130] EXT3-fs: mounted filesystem with ordered data mode.
[ 3.181350] udev[346]: starting version 164
[ 3.644863] input: PC Speaker as /devices/platform/pcspkr/input/input3
[ 3.654062] Error: Driver 'pcspkr' is already registered, aborting...
[ 3.663045] piix4_smbus 0000:00:14.0: SMBus Host Controller at 0xb00, revision 0
[ 3.810284] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[ 3.812865] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[ 3.860102] [drm] Initialized drm 1.1.0 20060810
[ 3.884550] hda-intel: no codecs found!
[ 3.884672] HDA Intel 0000:01:05.1: setting latency timer to 64
[ 3.925197] [drm] radeon defaulting to userspace modesetting.
[ 3.925973] pci 0000:01:05.0: setting latency timer to 64
[ 3.926082] [drm] Initialized radeon 1.32.0 20080528 for 0000:01:05.0 on minor 0
[ 4.123784] Adding 1998840k swap on /dev/sda2. Priority:-1 extents:1 across:1998840k
[ 4.126482] Adding 1998840k swap on /dev/sdb2. Priority:-2 extents:1 across:1998840k
[ 4.332550] EXT3 FS on md1, internal journal
[ 5.247285] alloc irq_desc for 25 on node -1
[ 5.247287] alloc kstat_irqs on node -1
[ 5.247299] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
[ 5.275326] ADDRCONF(NETDEV_UP): eth0: link is not ready
Tried to readd:
sudo mdadm --re-add /dev/md1 /dev/sdb3
mdadm: Cannot open /dev/sdb3: Device or resource busy
sudo mdadm --remove /dev/md1 /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md1
sudo mdadm --add /dev/md1 /dev/sdb3
mdadm: re-added /dev/sdb3
/var/log# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb3[2] sda3[1]
1860516800 blocks [2/1] [_U]
[>....................] recovery = 0.1% (2849024/1860516800) finish=455.9min speed=67898K/sec
md0 : active raid1 sdb1[0] sda1[1]
499904 blocks [2/2] [UU]
unused devices: <none>
Re-syncing didn't solve the problem:
triton1017:/var/log# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb3[2](S) sda3[1]
1860516800 blocks [2/1] [_U]
md0 : active raid1 sdb1[0] sda1[1]
499904 blocks [2/2] [UU]
unused devices: <none>
triton1017:/var/log# mdadm -D /dev/md1
/dev/md1:
Version : 0.90
Creation Time : Sun Apr 29 22:51:51 2012
Raid Level : raid1
Array Size : 1860516800 (1774.33 GiB 1905.17 GB)
Used Dev Size : 1860516800 (1774.33 GiB 1905.17 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Sat Feb 23 18:14:08 2013
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
UUID : ec02d5ce:8554d4ad:7792c71e:7dc17aa4
Events : 0.11245156
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 3 1 active sync /dev/sda3
2 8 19 - spare /dev/sdb3
kern.log files shows the following:
Feb 23 14:55:19 triton1017 kernel: [24036613.378608] ata1.00: error: { UNC }
Feb 23 14:55:19 triton1017 kernel: [24036613.398590] ata1.00: configured for UDMA/133
Feb 23 14:55:19 triton1017 kernel: [24036613.398627] ata1: EH complete
Feb 23 14:55:21 triton1017 kernel: [24036616.262518] ata1.00: exception Emask 0x0 SAct 0x1dfbe SErr 0x0 action 0x0
Feb 23 14:55:21 triton1017 kernel: [24036616.262525] ata1.00: irq_stat 0x40000008
Feb 23 14:55:21 triton1017 kernel: [24036616.262531] ata1.00: failed command: READ FPDMA QUEUED
Feb 23 14:55:21 triton1017 kernel: [24036616.262539] ata1.00: cmd 60/80:28:00:5a:b4/00:00:75:00:00/40 tag 5 ncq 65536 in
Feb 23 14:55:21 triton1017 kernel: [24036616.262540] res 41/40:80:38:5a:b4/00:00:75:00:00/00 Emask 0x409 (media error) <F>
Feb 23 14:57:16 triton1017 kernel: [24036730.503323] ata1.00: status: { DRDY ERR }
Feb 23 14:57:16 triton1017 kernel: [24036730.503328] ata1.00: error: { UNC }
Feb 23 14:57:16 triton1017 kernel: [24036730.523346] ata1.00: configured for UDMA/133
Feb 23 14:57:16 triton1017 kernel: [24036730.523356] ata1: EH complete
Feb 23 14:57:17 triton1017 kernel: [24036732.116026] INFO: task mysqld:6067 blocked for more than 120 seconds.
Feb 23 14:57:17 triton1017 kernel: [24036732.116032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 23 14:57:17 triton1017 kernel: [24036732.116040] mysqld D 0000000000000002 0 6067 938 0x00000000
Feb 23 14:57:17 triton1017 kernel: [24036732.116049] ffffffff814891f0 0000000000000086 0000000000000000 00000000ffffffff
Feb 23 14:57:17 triton1017 kernel: [24036732.117353] ffff880016dcfc00 0000000000015780 000000000000f9e0 ffff8805c4c65fd8
Feb 23 14:57:17 triton1017 kernel: [24036732.117367] 0000000000015780 0000000000015780 ffff880618825bd0 ffff880618825ec8
Feb 23 14:57:17 triton1017 kernel: [24036732.117380] Call Trace:
Feb 23 14:57:17 triton1017 kernel: [24036732.117391] [<ffffffff810168f3>] ? read_tsc+0xa/0x20
Feb 23 14:57:17 triton1017 kernel: [24036732.117400] [<ffffffff8110e656>] ? sync_buffer+0x0/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117408] [<ffffffff812fbb4a>] ? io_schedule+0x73/0xb7
Feb 23 14:57:17 triton1017 kernel: [24036732.117419] [<ffffffff8110e691>] ? sync_buffer+0x3b/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117426] [<ffffffff812fbf5a>] ? __wait_on_bit_lock+0x3f/0x84
Feb 23 14:57:17 triton1017 kernel: [24036732.117433] [<ffffffff8110e656>] ? sync_buffer+0x0/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117441] [<ffffffff812fc00a>] ? out_of_line_wait_on_bit_lock+0x6b/0x77
Feb 23 14:57:17 triton1017 kernel: [24036732.117451] [<ffffffff81065070>] ? wake_bit_function+0x0/0x23
Feb 23 14:57:17 triton1017 kernel: [24036732.117459] [<ffffffff8110ea83>] ? sync_dirty_buffer+0x29/0x93
Feb 23 14:57:17 triton1017 kernel: [24036732.117474] [<ffffffffa018ce04>] ? journal_dirty_data+0xd1/0x1b0 [jbd]
Feb 23 14:57:17 triton1017 kernel: [24036732.117486] [<ffffffffa01a3f1f>] ? ext3_journal_dirty_data+0xf/0x34 [ext3]
Feb 23 14:57:17 triton1017 kernel: [24036732.117499] [<ffffffffa01a23f9>] ? walk_page_buffers+0x65/0x8b [ext3]
Feb 23 14:57:17 triton1017 kernel: [24036732.117510] [<ffffffffa01a3f44>] ? journal_dirty_data_fn+0x0/0x13 [ext3]
Feb 23 14:57:17 triton1017 kernel: [24036732.117521] [<ffffffffa01a5a66>] ? ext3_ordered_write_end+0x73/0x10f [ext3]
Feb 23 14:57:17 triton1017 kernel: [24036732.117532] [<ffffffffa01b0bbb>] ? ext3_xattr_get+0x1ef/0x271 [ext3]
Feb 23 14:57:17 triton1017 kernel: [24036732.117542] [<ffffffff810b517e>] ? generic_file_buffered_write+0x18d/0x278
Feb 23 14:57:17 triton1017 kernel: [24036732.117552] [<ffffffff810b561a>] ? __generic_file_aio_write+0x25f/0x293
Feb 23 14:57:17 triton1017 kernel: [24036732.117560] [<ffffffff810b56a7>] ? generic_file_aio_write+0x59/0x9f
Feb 23 14:57:17 triton1017 kernel: [24036732.117569] [<ffffffff810eef1a>] ? do_sync_write+0xce/0x113
Feb 23 14:57:17 triton1017 kernel: [24036732.117577] [<ffffffff81103a85>] ? mntput_no_expire+0x23/0xee
Feb 23 14:57:17 triton1017 kernel: [24036732.117584] [<ffffffff81065042>] ? autoremove_wake_function+0x0/0x2e
Feb 23 14:57:17 triton1017 kernel: [24036732.117593] [<ffffffff812fce69>] ? _spin_lock_bh+0x9/0x25
Feb 23 14:57:17 triton1017 kernel: [24036732.117600] [<ffffffff810ef86c>] ? vfs_write+0xa9/0x102
Feb 23 14:57:17 triton1017 kernel: [24036732.117607] [<ffffffff810ef91c>] ? sys_pwrite64+0x57/0x77
Feb 23 14:57:17 triton1017 kernel: [24036732.117615] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Feb 23 14:57:17 triton1017 kernel: [24036732.117622] INFO: task flush-9:1:1456 blocked for more than 120 seconds.
Feb 23 14:57:17 triton1017 kernel: [24036732.117628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 23 14:57:17 triton1017 kernel: [24036732.117636] flush-9:1 D 0000000000000000 0 1456 2 0x00000000
Feb 23 14:57:17 triton1017 kernel: [24036732.117645] ffffffff814891f0 0000000000000046 0000000000000000 0000000000000001
Feb 23 14:57:17 triton1017 kernel: [24036732.117659] 0000000000000086 ffffffff8104a45a 000000000000f9e0 ffff88061905dfd8
Feb 23 14:57:17 triton1017 kernel: [24036732.117672] 0000000000015780 0000000000015780 ffff8806190646a0 ffff880619064998
Feb 23 14:57:17 triton1017 kernel: [24036732.117683] Call Trace:
Feb 23 14:57:17 triton1017 kernel: [24036732.117691] [<ffffffff8104a45a>] ? try_to_wake_up+0x289/0x29b
Feb 23 14:57:17 triton1017 kernel: [24036732.117701] [<ffffffff8119255f>] ? radix_tree_tag_clear+0x93/0xf1
Feb 23 14:57:17 triton1017 kernel: [24036732.117709] [<ffffffff8110e656>] ? sync_buffer+0x0/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117716] [<ffffffff812fbb4a>] ? io_schedule+0x73/0xb7
Feb 23 14:57:17 triton1017 kernel: [24036732.117724] [<ffffffff8110e691>] ? sync_buffer+0x3b/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117731] [<ffffffff812fbf5a>] ? __wait_on_bit_lock+0x3f/0x84
Feb 23 14:57:17 triton1017 kernel: [24036732.117738] [<ffffffff8110e656>] ? sync_buffer+0x0/0x40
Feb 23 14:57:17 triton1017 kernel: [24036732.117745] [<ffffffff812fc00a>] ? out_of_line_wait_on_bit_lock+0x6b/0x77
Feb 23 14:57:17 triton1017 kernel: [24036732.117753] [<ffffffff81065070>] ? wake_bit_function+0x0/0x23
Feb 23 14:57:17 triton1017 kernel: [24036732.117762] [<ffffffff8110fa23>] ? __block_write_full_page+0x159/0x2ac
Solution 1:
You can try to re-add the failed member to the mdadm array using the following commands:
sudo mdadm --re-add /dev/md1 /dev/sdb3
If you got resource or device busy error, you can try the following:
sudo mdadm --remove /dev/md1 /dev/sdb3
sudo mdadm --add /dev/md1 /dev/sdb3
If you tried them and got an error, please post the error message to get help.
Solution 2:
That disk is actually defective. Have it replaced. Re-sync after replacing the disk with a good one.