Buffer I/O Error on md-device - can't identify failed drive

Syncing my postgres master to the slave server resulted in write I/O errors on the slave (journalctl):

Aug 18 03:09:23 db01a kernel: EXT4-fs warning (device dm-3): 
**ext4_end_bio:330: I/O error -5 writing to inode 86772956 (offset 905969664 size 8388608 starting block 368694016)**                  
Aug 18 03:09:23 db01a kernel: buffer_io_error: 326 callbacks suppressed

....

Reading the affected file of course also doesn't work:

cat base/96628250/96737718  >> /dev/null
cat: 96737718: Input/output error

Shouldn't the linux kernel (ubuntu 16.04 4.4.0-87-generic) kick the affected drive from the array automatically?

As it is a Raid6 (LVM and ext4 on top) I already tried to overwrite every SSD a few times with badblocks to provoke the error (removed one disk after another from the raid for that), unfortunately with no success.

smartctl says one disk had errors before (the others are clean):

 smartctl -a /dev/sda
 ID# ATTRIBUTE_NAME         FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

 5  Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       2

179 Used_Rsvd_Blk_Cnt_Tot   0x0013   099   099   010    Pre-fail  Always       -       2

183 Runtime_Bad_Block       0x0013   099   099   010    Pre-fail  Always       -       2

187 Uncorrectable_Error_Cnt 0x0032   099   099   000    Old_age   Always       -       3

195 ECC_Error_Rate          0x001a   199   199   000    Old_age   Always       -       3

But rewriting the whole disk with badblocks -wsv worked without error.

As it is a pretty important server for me, I replaced the whole server with a different model, but the error persisted. Am I correct in thinking that it's probably a disk issue?

Is there any way to know which disk is affected, maybe by calculating?

EDIT: For clarification: What I'm not getting is how the initial sync of 1.5 TB data from the master to the slave can result in two unrecoverable I/O errors, but manually running destructive read-write tests on every involved SSD completes without any error.

EDIT2: Output of lsblk (identical for sda-sdf); pvs; vgs; lvs;

lsblk:
sda1                        8:16   0 953.9G  0 disk                                                
├─sda1                     8:17   0   4.7G  0 part                                                
│ └─md0                    9:0    0   4.7G  0 raid1                                               
└─sda5                     8:21   0 949.2G  0 part                                                
  └─md1                    9:1    0   2.8T  0 raid6                                               
    ├─vgdb01a-lvroot     252:0    0  18.6G  0 lvm   /                                             
    ├─vgdb01a-lvvar      252:1    0    28G  0 lvm   /var                                          
    ├─vgdb01a-lvtmp      252:2    0   4.7G  0 lvm   /tmp                                          
    └─vgdb01a-lvpostgres 252:3    0   2.6T  0 lvm   /postgres 

pvs: 
PV         VG      Fmt  Attr PSize PFree  
/dev/md1   vgdb01a lvm2 a--  2.78t 133.64g

vgs:
VG      #PV #LV #SN Attr   VSize VFree  
vgdb01a   1   4   0 wz--n- 2.78t 133.64g

lvs:
lvs
LV         VG      Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
lvpostgres vgdb01a -wi-ao----  2.60t                                                    
lvroot     vgdb01a -wi-ao---- 18.62g                                                    
lvtmp      vgdb01a -wi-ao----  4.66g                                                    
lvvar      vgdb01a -wi-ao---- 27.94g

Update 2017-8-22

echo check > /sys/block/md1/md/sync_action
[Mon Aug 21 16:10:22 2017] md: data-check of RAID array md1
[Mon Aug 21 16:10:22 2017] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[Mon Aug 21 16:10:22 2017] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[Mon Aug 21 16:10:22 2017] md: using 128k window, over a total of 995189760k.
[Mon Aug 21 18:58:18 2017] md: md1: data-check done.

echo repair > /sys/block/md1/md/sync_action    [Tue Aug 22 12:54:11 2017] md: requested-resync of RAID array md1
[Tue Aug 22 12:54:11 2017] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[Tue Aug 22 12:54:11 2017] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync.
[Tue Aug 22 12:54:11 2017] md: using 128k window, over a total of 995189760k.
[2160302.241701] md: md1: requested-resync done.

e2fsck -y -f /dev/mapper/vgdb01a-lvpostgres
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/vgdb01a-lvpostgres: 693517/174489600 files (1.6% non-contiguous), 608333768/697932800 blocks

Update 2017-8-22 2 Output of lsscsi and all disk smartctl on pastebin: https://pastebin.com/VUxKEKiF

UPDATE-8/22

If you want to solve this problem quickly just replace the two drives that have the worst smartctl stats and reassess. Once you're out of reserved blocks your drive is EOL. Seeing that we buy these all at once they tend to fail around the same time. So it doesn't matter which one is the bad block is pinned to. Once you replace the worst two offenders (that means replace one and resync and repeat), you'll have increased the overall health of the array, probably replaced the complaining disk, and dramatically reduced the risk of a double fault where you lose everything.

At the end of the day, your data is worth more than a few hundred bucks.

ENDUPDATE-8/22

UPDATE-8/21

Toni Yes, your original post has room for improvement. Given those facts this is the conclusion I arrived to. It also wasn't clear until now that you already suffered data corruption.

It would be helpful if you included the headers with the smartctl output. This is easier on scsi, sg_reassign will tell you how many free blocks you have left to reassign, once that goes to zero, you're done. Seeing that smartctl is reporting "prefail" in several categories it sounds like you're done soon too.

Soon you'll experience hard media errors and MD will kick the drive. fsck would be a good idea in the meanwhile. When drives fail a write they reassign the destination from the free block pool, when you run out, it becomes a unrecoverable media error.

Also enable "disk scrubber" on MD and run it on cron weekly, it will read and rewrite every sector and head this off before it becomes a real problem. See Documentation/md.txt in the kernel.

[disk scrubber example] https://www.ogre.com/node/384

You still have to run smartmon all the drives (once a day, off hours), parse the output, and create alarms to head off this very problem.

Folks, this is what hardware raids do for you. The irony is, we have all the tools to provide a better MD experience, but no one puts it together into a integrated solution.

You're pretty much at the tail end of silent data corruption. A fsck might help you, but really the best thing to do is refer to your backups (you kept backups right? RAIDs are not backups) and prepare for this RAID to start sinking.

Then you'll find the bad disk.

Sorry.

ENDUPDATE-8/21

For starters, did you read the man page for badblocks for the options you used?

   -w     Use write-mode test. With this option, badblocks scans for bad  blocks  by  writing
          some  patterns (0xaa, 0x55, 0xff, 0x00) on every block of the device, reading every
          block and comparing the contents.  This option may not  be  combined  with  the  -n
          option, as they are mutually exclusive.

So your data is gone, -n was the nondestructive version. Maybe what you really did was pull a disk from the array, run badblocks on it, and then reinserted it? Please clarify.

That you don't know which disk is failed to begin with tells me that it is not an MD raid array. So whatever non-existent lvm "raid" tools exist to help you recover from this simple failure, that's what you need to figure out.

I would say that the majority of users go with an MD raid solution. The remainder get distracted by "what's this thing?" or "oh, this is LVM, it's what I'm supposed to do, right?" and later end up where you are now. I raid implementation with terrible management tools which actually created more risk than you attempted to mitigate by building a RAID 6 to begin with.

It's not your fault, you didn't know. Frankly, they should disable the thing for exactly this reason.

Concerning repairing bad blocks. You can do this by taking the machine offline and booting to a live usb drive and performing one of the following repair procedures.

https://sites.google.com/site/itmyshare/storage/storage-disk/bad-blocks-how-to

http://linuxtroops.blogspot.com/2013/07/how-to-find-bad-block-on-linux-harddisk.html

As to where this sector is in your array. Well, you would have to account for the parity rotation, which is a PITA. I would suggest that you simply verify each drive until you find the problem.

You can help prevent this in the future by enabling "disk scrubbing" in MD which reads and rewrites each sector in a maintenance window to discover exactly these sort of problems and potentially repair them.

I hope this helps.

Buffer I/O Error on md-device - can't identify failed drive

Related

Recent Posts