ZFS stripe on top of hardware RAID 6. What could possibly go wrong?

I have 36*4TB HDD SAN Rack. RAID controller did not support RAID60 and not more than 16 HDDs in one RAID group. So I decided to make 2 RAID6 groups of 16HDD or 4 of 8 HDDs. I want to get all storage as one partition.

So, what could possibly go wrong if I will use zfs pool on top of hardware RAID6? Yeah, I know that it is strongly recommended to use native HDDs or pass-through mode. But I have not this option.

Or should I stay away from ZFS and software raids in this situation? (I'm mostly interested in compression and snapshots)


Solution 1:

So I decided to make 2 RAID6 groups of 16HDD or 4 of 8 HDDs.

That's not the best way to do things. It may work well enough, but depending on your performance requirements, it may not.

The ideal size for a RAID5/6 array will be such that an exact multiple of the amount of data that "spans" the array matches the block size of the file system built on top of it.

RAID5/6 arrays work as block devices - a single block of data spans the disks in the array, and that block also contains parity data. Most RAID controllers will write a power-of-two sized chunk of data to each disk in the array - the exact value of which is configurable in better RAID systems - and your Dot Hill unit is one of those "better RAID systems". That's important.

So it takes N x (amount of data stored per disk chunk ) to span the array, where N is the number of data disks. A 5-disk RAID5 array has 4 "data" disks, and a 10-drive RAID6 array has 8 data disks.

Because when data is written to a RAID5/6 array, if the block of data is such that it's big enough to span the entire array, the parity is computed for that data - usually in the controller's memory - then the entire stripe is written to disk. Simple, and fast.

But if the chunk of data being written isn't big enough to span the entire array, what does the RAID controller have to do in order to compute the new parity data? Think about it - it needs all the data in the entire stripe to recompute the new parity data.

So if you make a 16-drive RAID6 array with the default per-disk chunk of 512kb, that means it takes 7 MB to "span" the array.

ZFS works in 128kb blocks, generally.

So ZFS writes a 128kB block - to a 16-drive RAID6 array. In the configuration you're proposing, that means the RAID controller needs to read almost 7 MB from the array and recompute the parity across those 7 MB. Then rewrite that entire 7 MB back to disk.

If you're lucky, it's all in cache and you don't take a huge performance hit. (This is one major reason why the "don't use RAID5/6" position has such a following - RAID1[0] doesn't suffer from this.)

If you're unlucky and you didn't properly align your filesystem partitions, that 128kB block spans two RAID stripes that aren't in cache, and the controller needs to read 14 MB, recompute parity, then write 14 MB. All to write one 128kB block.

Now, that's what needs to happen logically. There are a lot of optimizations that good RAID controllers can take to reduce the IO and computational load of such IO patterns, so it might not be that bad.

But under heavy load of writing 128kB blocks to random locations, there's a really good chance that the performance of a 16-drive RAID6 array with a 7 MB stripe size will be absolutely terrible.

For ZFS, the "ideal" underlying RAID5/6 LUNs for a general purpose file system where most accesses are effectively random would have a stripe size that's an even divisor of 128kB, such as 32kB, 64kB, or 128kB. In this case, that limits the number of data disks in a RAID5/6 array to 1 (which is nonsensical - even if possible to configure, it's better to just use RAID1[0]), 2, 4, or 8. Best performance in the best-case scenario would be to use a 128kB stripe size for the RAID5/6 arrays, but best-case doesn't happen often in general-purpose file systems - often because file systems don't store metadata the same as they store file data.

I'd recommend setting up either 5-disk RAID5 arrays or 10-disk RAID6 arrays, with the per-disk chunk size set small enough that the amount of data to span an entire array stripe is 64kB (yeah, I've done this before for ZFS - many times). That means for a RAID array with 4 data disks, the per-disk chunk size should be 16kB, while for an 8-data-disk RAID array, the per-disk chunk size should be 8kB.

Then allow ZFS to use the entire array - do not partition it. ZFS will align itself properly to an entire drive, whether the drive is a simple single disk or a RAID array presented by a RAID controller.

In this case, and without knowing your exact space and performance requirements, I'd recommend setting up three 10-drive RAID6 array or six 5-drive RAID5 arrays with 64kB stripe size, configure a couple of hot spares, and save four of your disks for whatever comes up in the future. Because something will.

I would most certainly not use that disk system in JBOD mode - it's a fully NEBS Level 3-compliant device that provides significant reliability and availability protections built right into the hardware. Don't throw that away just because "ZFS!!!!". If it's a cheap piece of commodity hardware you put together from parts? Yeah, JBOD mode with ZFS handling the RAID is best - but that's NOT the hardware you have. USE the features that hardware provides.

Solution 2:

Okay, I'll bite...

This is the wrong hardware for the application. The DotHill setup has the same limitations as an HP StorageWorks MSA2000/P2000 in that only 16 drives can be used in a single array grouping.

ZFS atop hardware RAID or an exported SAN LUN is not necessarily a problem.

However, striping ZFS LUNs over unknown interconnects, across expansion chassis can introduce some risk.

  • For instance, are you running multipath SAS in a ring topology with dual controllers?
  • Do you have redundant cabling back to the server?
  • Have you distributed drives vertically across enclosures in a manner that would mitigate failure of a single chassis/cable/controller and prevent it from destroying a part of your RAID0 stripe?

Seriously, it may be worth evaluating whether you need all of this storage in a single namespace...

If you DO require that type of capacity in a single mount, you should be using a dedicated HBA-attached JBOD enclosure and possibly multiple head units with resilient cabling and a smarter layout.

Solution 3:

You should DIRECTLY attach all drives to a box running ZFS. Get a SAS HBA and connect the drives to the ZFS capable box (eg runing OmniOS or SmartOS). You can then share the space via NFS, SMB, iScsi ...