Is a large RAID-Z array just as bad as a large RAID-5 array?
For a long time I've heard about how bad an idea a large (>5TB?) RAID-5 array is simply because there's a high risk for another drive to fail.
Has RAID-Z1 managed to remedy this for an array of any size (if you absolutely need a number consider 4x2TB or 5x2TB)? Maybe a safer way to re-replicate the data that isn't as intense on all the drives?
Solution 1:
Even given what one of the other answers here laid out, namely that ZFS only works with actual used blocks and not empty space, yes, it is still dangerous to make a large RAIDZ1 vdev. Most pools end up at least 30-50% utilized, many go right up to the recommended maximum of 80% (some go past it, I highly recommend you do not do that at all, for performance reasons), so that ZFS deals only with used blocks is not a huge win. Also, some of the other answers make it sound like a bad read is what causes the problem. This is not so. A bit rot inside a block is not what's going to screw you here, usually, it's another disk just flat out going bad while the resilver from the first disk going bad is still going on that'll kill you.. and on 3 TB disks in a large raidz1 it can take days, even weeks to resilver onto a new disk, so your chance of that happening is not insignificant.
My personal recommendation to customers is to never use RAIDZ1 (RAID5 equivalent) at all with > 750 GB disks, ever, just to avoid a lot of potential unpleasantness. I've been OK with them breaking this rule because of other reasons (the system has a backup somewhere else, the data isn't that important, etc), but usually I do my best to push for RAIDZ2 as a minimum option with large disks.
Also, for a number of reasons, I usually recommend not going more than 8-12 disks in a raidz2 stripe or 11-15 disks in a raidz3 stripe. You should be on the low-end of those ranges with 3 TB disks, and could maybe be OK on the high-end of those ranges on 1 TB disks. That this will help keep you away from the idea that more disks will fail while a resilver is going on is only one of those reasons, but a big one.
If you're looking for some sane rules of thumb (edit 04/10/15 - I wrote these rules with only spinning disks in mind - because they're also logical [why would you do less than 3 disks in a raidz1] they make some sense even for SSD pools but all-SSD pools was not a thing in my head when I wrote these down):
- Do not use raidz1 at all on > 750 GB disks.
- Do not use less than 3 or more than 7 disks on a raidz1.
- If thinking of using 3-disk raidz1 vdevs, seriously consider 3-way mirror vdevs instead.
- Do not use less than 6 or more than 12 disks on a raidz2.
- Do not use less than 7 or more than 15 disks on a raidz3.
- Always remember that unlike traditional RAID arrays where # of disks increase IOPS, in ZFS it is # of VDEVS, so going with shorter stripe vdevs improves pool IOPS potential.
Solution 2:
Is RAID-Z as bad as R5, no. Is it as good as R1 or R10, usually no.
RAID-Z is aware of blank spots on the drives, where R5 is not. So RAID-Z only has to read the areas with data to recover the missing disk. Also, data isn't necessarily striped across all the disks. A very small file might reside on just a single disk, with the parity on another disk. Because of this RAID-5 will have to read exactly as much data as the space used on the array (if 1mb is used on a 5TB array, then a rebuild only needs to read 1 mb).
Going the other way, if most of a large array is full, then most of the data will need to be read off all the disks. Compared to R1 or R10 where the data only needs to be pulled off exactly one disk (per failed disk; if multiple disks fail only in situations where the array is still recoverable too).
What you're worrying about is the fact that with every sector read operation there's a chance you'll find a sector that wasn't written correctly or is no longer readable. For a typical drive these days that's around 1x10^-16 (not all drives are equal, so lookup the specs on your drives to figure out their rating). This is incredibly infrequent, but comes out to about once every 1PB; for a 10TB array there's a 1% chance your array is toast and you don't know it until you try to recover it.
ZFS also helps mitigate this chance, since most unreadable sectors are noticeable before you start trying to rebuild your array. If you Scrub your ZFS array on a regular basis, the scrub operation will pickup these error and work around them (or alert you so you can replace the disk if that's how you roll). They recommend you scrub enterprise-grade disks about one to four times a month; and consumer-grade drives at least once a week, or more.
Solution 3:
Some of RAID-Z advantages over traditional RAID-5 is that it doesn't requires specialized hardware and is more reliable by avoiding RAID-5 write hole.
However, both RAID-Z and RAID-5 do not sustain more than one disk failure.
Should you want to survive two disks failure with ZFS, you can use RAIDZ2, and three disks failure RAIDZ3.