Is RAIDZ1 worse than no fault tolerance for an array of 4TB drives?
On this question, Michael Kjörling and user121391 seem to make a case that RAIDZ1 (ZFS's equivalent of RAID5) is not reliable and that I should use RAIDZ2 (ZFS's equivalent of RAID6). user121391 comments there:
While rebuilding a failed drive, all data from all drives must be read. This increases the stress on the disks (especially if they are mostly idle normally) and therefore the chance of another drive failing. Additionally, while reading all data you may get an URE from one of the disks with no second disk to compensate, which means files can be damaged/lost. Third, the bigger your disks are, the longer your window of vulnerability becomes, not just for those problems, but any problems that may occur on the disks or system (power outage etc.).
For my specific use case (a home media server), I am looking to gain some fault tolerance with minimal expense in terms of redundant storage. All unrecoverable data will be backed up, but it will be very inconvenient if a drive fails, as I would have to rip a large number of DVDs and books again, download large amounts of music from various services to rebuild the media server.
My question is - is a RAIDZ1 an incremental improvement on no fault tolerance, given that I am not willing to sacrifice more than 25-33% of the total pool size in the service of fault tolerance, or will it dramatically increase the chances that, if one disk fails, the entire pool will fail completely causing complete data loss.
If it helps at all, most of these data will not be changing (they are media files), and everything not theoretically recoverable will be backed up.
I think it was a misunderstanding in the old thread. I was comparing the chance of failure for two disks in a row when using either Z1 parity raid or no RAID (as you stated in the comments in the other thread). In my eyes it was never about Z1 vs. striped pool of basic vdevs, because that game is essentially over after the first fault anyway, so Z1 is of course better.
But if you just compare multiple independent pools against a single pool with a single Z1 vdev, then the problem of increased load while recalculating the parity information persists.
On the comparison of Z1 vs Z2, which the answer by Michael was mainly about, the other two points apply. I should have been more clearly in the comments, but they are limited in space unfortunately. I hope this answer clears some of this.
I thought the same thing, but I didn't realize that a URE isn't just a bit flip, it spoils the entire pool.
If we simplify the whole thing, you have your disk with its controller chip on the bottom and your hardware (RAID controller) or software (e. g. ZFS) on the top.
If any error happens in the hardware and a sector cannot be read, the chip first tries to correct it on its own if possible (for example by reading the problem sector multiple times). If it still can't make anything out of it, it gives up (on normal disks, this can take minutes and stalls the complete system which waits for "successful" or "failure" message regarding the IO operation that is pending.
Some disks have a feature called TLER (time limited error recovery), which is a hard timeout that limits this error correction time to 6-9 seconds, because traditionally, most hardware RAID controllers dropped the whole disk after 9 seconds, so a single bad sector should not make the whole disk unavailable, but be corrected by a "good" sector on the other disks (a feature that a single disk on a desktop system could not rely on, so a long timeout would be preferable).
Now, let's look at the software side: if you configure your raid controller or ZFS file system with redundancy, for example by using mirrored disks or a mirror vdev as basis for your pool, your URE can be corrected. If you do not use redundancy, the data on this sector will be gone, which may be data you care about or just random old temp data or nothing, depending on your luck. The same applies to bit flips, although the chance of them happening seems to be more dependent on outside effects (like cosmic radiation).
Since RAID0 is not subject to UREs, the question is "what is more likely, a URE in RAIDZ or a disk failure in RAID0?"
I haven't accepted this answer because I don't think it adequately explains the relevant points, but I was planning on creating my own answer once I understand why UREs destroy the whole pool, if no one else gets to it first.
I suggest you read a basic explanation of ZFS pool layout. To summarize the most important bits:
- You can create virtual devices (vdevs) from disks, partitions or files. Each vdev can be created with different redundancy: basic (no redundancy), mirrored (1 to N disks can fail), parity raid Z1/Z2/Z3 (1/2/3 disks can fail). All redundancy works on the vdev level.
- You create storage pools from one or more vdevs. They are always striped, therefore the loss of a single vdev means the loss of the whole pool.
- You can have any number of pools, which are independent. If one pool is lost, the other pools continue to function.
Therefore you can reason the following:
- If possible, prefer Z2 over Z1 because of the increased load and big window of (negative) opportunity when rebuilding large drives (large being anything over 1 TB approximately)
- If having to choose between Z1 and multiple basic vdevs, prefer Z1 because of bit error correction which is not possible with basic vdevs
- If you can accept partial pool loss, segment your pool into multiple smaller pools backed by a single vdev each, so that you get checksum information and faster rebuild times on fatal faults
In any of the above cases, you need to have a backup. If you cannot or don't want to afford any backup, it is about what you are more comfortable to lose - some parts of the pool with higher probability or everything with lower probability. I personally would choose the first option, but you may decide otherwise.
What is implied in answer you quoted is that with increasing storage capacity chance of failure increases accordingly, not only for rebuilding operation but for normal activity as well. So, statistically speaking, RAIDZ1 is no more fault tolerant than Raid 0 when talking about modern 4TB drives, even though case is made prima facie that it is.
So some argue that RAIDZ1 is, in fact, not an increase in protection against data loss for large-capacity hard disk drives. This has less to do with mechanical failure of the drive(s), or at least not with critical failure. URE is, to put it simply (and very simplistically) is failure to read. Be it due to prolonged read from bad sector of the drive, disk running out of spare sectors - or any other cause - it's not really an issue. It will happen, like it or not. Let's then take the bad sector example - Normally this is handled by drive internally, but if there's enough of them or the drive will take it's sweet while to fix that the RAIDZ controller might interpret the delay as drive failure and eject the drive. Now, let's imagine it's the SECOND hard drive in the pool, and it happened while rebuilding... The only viable solution is to scrub the array for those errors - if caught early, the error will be just a burp - pool will recover the data easily. But this means putting quite a big load on drives, which then increases drastically the statistical chance of URE (remember: age, writes, volume of data all increase it a lot already, without increasing the reads by order of magnitude from normal operations; all for each drive separately).
Thus the answer to your question (is a RAIDZ1 an incremental improvement on no fault tolerance
) is: not really. If we use logic of the quote you face 50% chance (I think) of enough disk failures for data to be unrecoverable within first two years of said disks operation..
That is why when in our company we were faced with the dilemma of server availability or storage capacity we bit the bullet and went for RAID6 on SSDs. Should be enough for couple of years and then probably upgrade, if needed.