If a RAID5 system experiences a URE during rebuild, is all the data lost?

I understand the argument regarding larger drives' increased likelihood of experiencing a URE during a rebuild, however I'm not sure what the actual implications are for this. This answer says that the entire rebuild fails, but does this mean that all the data is inaccessible? Why would that be? Surely a single URE from a single sector on the drive would only impact the data related to a few files, at most. Wouldn't the array still be rebuilt, just with some minor corruption to a few files?

(I'm specifically interested in ZFS's implementation of RAID5 here, but the logic seems the same for any RAID5 implementation.)

It really depends on the specific RAID implementation:

most hardware RAID will abort the reconstruction and some will also mark the array as failed, bringing it down. The rationale is that if an URE happens during a RAID5 rebuild it means some data are lost, so it is better to completely stop the array rather that risking silent data corruption. Note: some hardware RAID (mainly LSI based) will instead puncture the array, allowing the rebuild to proceed while marking the affected sector as unreadable (similar to how Linux software RAID behaves).
linux software RAID can be instructed to a) stop the array rebuild (the only behavior of "ancient" MDRAID/kernels builds) or b) continue with the rebuild process marking some LBA as bad/inaccessible. The rationale is that it is better to let the user do his choice: after all, a single URE can be on free space, not affecting data at all (or affecting only unimportant files);
ZRAID will show some file as corrupted, but it will continue with the rebuild process (see here for an example). Again, the rationale is that it is better to continue and report back to the user, enabling him to make an informed choice.

If URE will happen you'll experience some data corruption over the block which is typically 256KB-1MB in size, but this doesn't mean ALL the data on your volume would be lost. What's not so great about RAID5 is a totally different thing: Rebuild itself is stressful and there're high chances you'll get second disk failure in a row. In such a case all the data would be lost.

I would explain it the other way around;

If the RAID controller don’t stop on URE, what could happen ?

I lived it on a server, the RAID never noticed the URE and after the rebuild a corruption started to build up on the entire RAID volume.

The disk started to get more bad sector after the rebuild and the data started to be corrupt.

The disk was never kicked off the RAID volume, the controller fail is job to protect the data integrity.

That example is wrote to make you think that a controller can’t thrust a volume with URE at all, its for the data integrity, as the volume is not meant to be a backup but a resiliance to a disk failure

I'd suggest reading this question and answers for a bit more background. Then go and re-read the question you linked to again.

When someone says about this situation that "the RAID failed," it means you lost the benefit of the RAID - you lost the continuous access to data that was the reason you set up the RAID array in the first place.

You haven't lost all the data, but the most common way to recover from one dead drive plus (some) UREs on (some of) the remaining drives would be to completely rebuild the array from scratch, which will mean restoring all your data from backup.

If a RAID5 system experiences a URE during rebuild, is all the data lost?

Related

Recent Posts