write hole: which RAID levels are affected?
In my journey to understanding the advantages of RAIDZ, i came across the concept of write hole.
As this page explains, a write hole is the inconsistency you get among the disks of the array, when the power is lost during a write. That page also explains that it affects both RAID-5/6 (if the power is lost after the data has been written, but before the parity has been calculated) and RAID-1 (data is written to one disk but not the others), and that it is an insidious problem that can only be detected during either a resync/scrub, or (disastrously) during the reconstruction of one of the disks...however, most of the other sources talk about it as it only affected parity-based RAID levels.
From what i understand, i think this could be a problem for RAID-1 too, as reads from the disks containing the hole would return garbage, so...is it a problem for every RAID level or not? Is it implementation-dependent? Does it affect software-RAID only, or also hardware controllers? (extra: how does mdadm
fare in this regard?)
The term write hole is something used to describe two similar, but different, problems arising when dealing with non-battery-protected RAID arrays:
sometime it is improperly defined as any corruption in a RAID array due to sudden power loss. With this (erroneous) definition, RAID1 is vulnerable to write hole because you can not atomically write to two different disks;
the proper definition of write hole, which is the loss of an entire stripe data redundancy due to a sudden power loss during stripe update, is only applicable to parity-based RAID.
The second, and correct, definition of write hole needs some more explanation: let's assume a 3-disk RAID5 with 64K chunk size and 128K stripe size (+64K parity size for each stripe). If power is lost after writing 4K to disk #1 but during parity update on disk #3, we can have a bogus (ie: corrupted) parity chunk and an undetected data consistency issue. If, later, disk #2 dies and parity is used to recover the original data by xoring disk #1 and disk #3, the reconstructed 64K, originally residing on disk #2 and not recently written, will be nonetheless corrupted.
This is a contrived example, but it should expose the main problem related to write hole: the loss of untouched, at-rest, unrelated data sharing the same stripe with the latest, interrupted writes. In other word, if fileA was written years ago but shares the same stripe with the just-written fileB and the system loses power during fileB update, fileA will be at risk.
Another thing to consider is the write policy of the array: using read/reconstruct/write (ie: entire stripes are rewritten when partial write happens) versus read/modify/write (ie: only the affected chunk+parity are updated) expose to different kind of write hole.
From the above, it should be clear because RAID0 and RAID1 do not suffer from a proper write hole: they have no parity which can be "out-of-sync" invalidating an entire stripe. Please note that RAID1 mirror legs can be out-of-sync after an unclean shutdown, but the only corruption will be of the latest written data. Previously written data (ie: data at rest) will not face any trouble.
Having defined and scoped a proper write hole, how can be avoided?
HW RAID uses non volatile write cache (ie: BBU+DRAM or capacitory-backed flash module) to persistently store the to-be-written updates. If power is lost, the HW RAID card will re-issue any pending operation, flushing its cache to disk platters, when power is restore and system boot up. This protects not only from proper write hole, but from last-written data corruption also;
Linux MD RAID uses a write bitmap which records the to-be-written striped before updating them. If power is lost, the dirty bitmap is used to recalculate any parity data for the affected stripes. This protects from real write hole only; latest written data can be corrupted (unless backed by a fsync()+write barrier). The same method is used to re-sync out-of-sync portion of a RAID1 array (to be sure the two mirror legs are in-sync, albeit no write hole exists for mirrors);
newer Linux MD RAID5/6 should have the option to use a logging/journal device, partly simulating the non-volatile writeback cache of proper HW RAID card (and, depending on the specific patch/implementation, protecting from both write hole and last-written data corruption or from write hole only);
finally, RAIDZ avoid both write hole and last-data corruption using the most "elegant", but performance-impacting, method: by only writing full-sized stripes (and journaling any synchronized write in the ZIL/SLOG).
Useful links:
https://neil.brown.name/blog/20110614101708
https://www.kernel.org/doc/Documentation/md/raid5-ppl.txt
https://www.kernel.org/doc/Documentation/md/raid5-cache.txt
https://lwn.net/Articles/665299/
This is why a cache battery or some other method of cache consistency validation is required for raid. All raid cards should have battery backed cache, and all storage controllers should have mirrored cache. For software raid, I don't think there is a good answer. I think even raid Z can fail on a power loss.