e2fsck cleans a filesystem and then a few minutes later (after a lot of reads) there are errors
The filesystem is on an LVM RAID5. It appears to be working correctly:
$ sudo pvs
[sudo] password for jrwren:
PV VG Fmt Attr PSize PFree
/dev/sda2 datavg lvm2 a-- <7.28t 2.80t
/dev/sdb2 datavg lvm2 a-- <3.64t 0
/dev/sdc2 datavg lvm2 a-- <7.28t <7.28t
/dev/sdd2 datavg lvm2 a-- <7.28t 0
/dev/sde2 datavg lvm2 a-- <7.28t 73.82g
/dev/sdf1 datavg lvm2 a-- <3.64t 0
/dev/sdg2 datavg lvm2 a-- <7.28t 3.99t
/dev/sdh2 datavg lvm2 a-- <447.11g 8.00m
/dev/sdi2 datavg lvm2 a-- <9.10t 2.21t
$ sudo lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lxd2 datavg -wi-ao---- 147.10g
mirrored datavg -wi-ao---- 300.00g
m datavg Rwi-aor--- 3.52t 100.00
m3 datavg Rwi-aor--- 4.00t 100.00
mu datavg Rwi-aor--- 1.00t 100.00
nomirror datavg -wi-ao---- 2.20t
photos datavg Rwi-aor--- 200.00g 100.00
stor datavg Rwi-aor--- 300.00g 100.00
storj datavg -wi-ao---- 1.00t
t datavg Rwi-aor--- 6.00t 100.00
t2 datavg Rwi-aor--- 3.90t 100.00
I have a process doing many reads on logical volume named m. This is device dm-12. Eventually, it just dies with the following kernel messages.
Jun 30 16:02:33 delays kernel: [393661.035286] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365192: com[68/1946]t main: pblk 765519712 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:02:33 delays kernel: [393661.039726] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365192: comm rtorrent main: pblk 765519712 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:02:33 delays kernel: [393661.044175] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365192: comm rtorrent main: pblk 765519712 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:02:33 delays kernel: [393661.048584] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365192: comm rtorrent main: pblk 765519712 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:02:33 delays kernel: [393661.054717] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365192: comm rtorrent main: pblk 765519712 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:02:33 delays kernel: [393661.060977] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365192: comm rtorrent main: pblk 765519712 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:02:33 delays kernel: [393661.063736] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365192: comm rtorrent main: pblk 765519712 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:02:33 delays kernel: [393661.066283] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365192: comm rtorrent main: pblk 765519712 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:02:33 delays kernel: [393661.068773] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365192: comm rtorrent main: pblk 765519712 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:02:33 delays kernel: [393661.071232] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365192: comm rtorrent main: pblk 765519712 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
I unmount the filesystem and run e2fsck:
$ sudo e2fsck -p /dev/datavg/m
movies contains a file system with errors, check forced.
movies: Inode 118751237 has an invalid extent node (blk 475078659, lblk 0)
movies: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
$ sudo e2fsck -y /dev/datavg/movies
e2fsck 1.45.7 (28-Jan-2021)
movies contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 177471496 has an invalid extent node (blk 709943175, lblk 0)
Clear? yes
...
Pass 1E: Optimizing extent trees
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(709943175--709943176) -(868210688--868212735) -(868214784--868216831) -(868253696--868255743) -(868257792--868259839) -(868886528--868888575) -(868892672--868894719) -(868896768--868898815) -(868900864--868902911) -(868904960--868907007) -(868909056--868911103) -(868913152--868917247) -(868921344--868923391) -(868925440--868927487) -(868929536--868931583) -(868933632--868935679) -(868937728--868939775) -(868941824--868943871) -(868945920--868947967) -(868950016--868954111) -(868958208--868960013) -(869894144--869922573)
Fix? yes
Free blocks count wrong for group #21665 (24561, counted=24563).
Fix? yes
Free blocks count wrong for group #26495 (28672, counted=32768).
Fix? yes
Free blocks count wrong for group #26497 (18432, counted=22528).
Fix? yes
Free blocks count wrong for group #26516 (22528, counted=32768).
Fix? yes
Free blocks count wrong for group #26517 (16384, counted=32768).
Fix? yes
Free blocks count wrong for group #26518 (16626, counted=26624).
Fix? yes
Free blocks count wrong for group #26547 (2290, counted=30720).
Fix? yes
Free blocks count wrong (366951912, counted=367025158).
Fix? yes
movies: ***** FILE SYSTEM WAS MODIFIED *****
movies: 6896/236224512 files (20.8% non-contiguous), 577868794/944893952 blocks
$ sudo e2fsck -p /dev/datavg/movies
movies: clean, 6896/236224512 files, 577868794/944893952 blocks
It says it is clean, so I remount it and rerun the reading software.
And a few minutes later:
Jun 30 16:34:49 delays kernel: [395595.309814] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365190: comm rtorrent main: pblk 765517692 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:34:49 delays kernel: [395595.317838] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365190: comm rtorrent main: pblk 765517692 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:34:49 delays kernel: [395595.320836] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365190: comm rtorrent main: pblk 765517692 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:34:49 delays kernel: [395595.323418] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365190: comm rtorrent main: pblk 765517692 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:35:14 delays kernel: [395619.785771] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365190: comm rtorrent main: pblk 765517692 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Jun 30 16:35:14 delays kernel: [395619.793135] EXT4-fs error (device dm-12): ext4_find_extent:885: inode #191365190: comm rtorrent main: pblk 765517692 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
What is going on here? Is the LVM corrupt and lying to me? Is there a command I can run to check? Should I run a badblocks (e2fsck -c) or something?
There are no corresponding LVM messages from the kernel. I'd expect LVM errors if the underlying disks had problems. What is going on?
update: someone asked for dmesg output. That is exactly what is above with the EXT4-fs messages. The only other messages in dmesg output other than standard boot messages is this repeated:
[527724.593062] rptaddrs[3948921]: segfault at 7ffc7a7a50b5 ip 00007fd9f0f86820 sp 00007ffc7a7a3fc8 error 4 in libc-2.28.so[7fd9f0e4c000+148000] [527724.593075] Code: 7f 07 c5 fe 7f 4f 20 c5 fe 7f 54 17 e0 c5 fe 7f 5c 17 c0 c5 f8 77 c3 48 39 f7 0f 87 ab 00 00 00 0f 84 e5 fe ff ff c5 fe 6f 26 <c5> fe 6f 6c 16 e0 c5 fe 6f 74 16 c0 c5 fe 6f 7c 16 a0 c5 7e 6f 44
Solution 1:
The two times this has happened to me, the cause has been a hardware fault. Possible underlying causes:
- badly-connected cables
- bad disk cable (happened to me once)
- buggy SATA interface (I had an interface which wrote a block of zero bytes into my disk device, just once, but then I discarded the card)
- bad RAM (corrupting buffered data)
- overheating or errors introduced by overclocking
- probably less likely, other hardware faults
Both times this happened to me I experienced data loss. These days, that's much less likely, because I use ZFS with replicated snapshots and also have offline tape backups.
The fact that you can fsck and then find it fails again immediately afterward convinces me it's a hardware problem. The blocks being written by fsck to the disk when "fixing" the problem are probably not (always) making it to the disk surface uncorrupted, I predict.
First of all, make sure your existing cables are correctly seated and re-test. If that doesn't fix the problem, read on:
You might be able to prove this is the problem with a test disk:
- Obtain a live bootable system image e.g. on a USB drive. Do not prepare this on your faulty machine, because presumably it will get corrupted. Use some other machine, or buy a ready-made live-Linux-system USB stick.
- Power down the system.
- Label every hard disk with how it is connected to the SATA interfaces (e.g. which port etc.)
- Disconnect the drives and properly store them (i.e. in robust anti-static containers). Do not plug them back into the system until you have isolated and fixed the problem, because your efforts to fix the problem with
fsck
are just making it worse. - Plug in a sacrificial disk containing no valuable data at all that you can safely overwrite
- Double-check that your sacrificial disk and the live bootable image (see next item) are the only storage devices connected to the machine. You need to avoid accidentally partitioning a disk with your valuable data on it, or running
badblocks
on such a disk. - Boot from a live system image (e.g. bootable USB live system)
- Partition the drive into a small number of partitions, the first one being a few tens of GB
- Run
badblocks -w -B
(the-B
makes sure we exercise the RAM too) on a small partition (selecting a small one so that the test doesn't take days) - If this fails, you have a hardware problem; try changing components to see if the problem goes away
- for example, take out all RAM modules except one, rotate through them to identify which one is bad
- for example, change which SATA port you connect to, to identify a bad SATA interface or adapter
- for example, keep the same SATA port but change the cable, to identify a bad cable
- It's possible that flaws in other system components (even a faulty motherboard or an under-powered PSU) may cause the problem
- If you suspect bad RAM, you can use
memtest86
to test it. You can also omit the-B
flag from badblocks to use direct I/O instead which will reduce but not eliminate the use of RAM.
Once you have identified the faulty hardware, replace it. Ideally, restore your most recent backup onto fresh disks (noting that if you didn't actually isolate and fix the problem, the data on your fresh disks will also get corrupted).
Edit: you're welcome to downvote, but should you decide to do so I'd appreciate it if you would leave a comment pointing out why this answer is not useful.