Recommended approach to build a 24-disk pooled SSD hot-set cache: RAID, LVM JBOD, etc?

Solution 1:

You appear to be contradicting your needs - "My ideal solution would have each subdir (I have one self contained dataset per subdir) of / completely contained on a single disk" tells you that you don't want RAID, LVM or any abstraction technology - *surely the solution to this is would be to simply mount each disk individually. The disadvantage here is you are likely to waste disk space and if the data set grows you will need to spend more time juggling it. (I expect you know Unix can mount drives in arbitrary places of a filesystem tree, so with a bit if thought it should be easy enough to make the drives visible as a logical tree structure)

You talk about JBOD or RAID0. If you do decide for a combined disk solution, RAID0 will give you better read performance in most cases, as data is broken up over the disks easily. RAID10 would buy you redundancy you said you don't need. JBOD is only useful to you if you have disks of different sizes, and you would be better off using LVM instead, as it can behave the same way but give you flexibility to move data around.

I can see edge cases where LVM would help over individual disk, but in general, any scenario is likely to add more complexity then it gives useful flexibility here - particularly bearing in mind the initial statement about data sets being bound to disks.

Where you might want to spend some effort is looking at the most appropriate file system and tuning parameters.

Solution 2:

I care more about performance, complexity of maintenance, and downtime more than lost data.

Maximizing performance indicates you need to use some form of RAID-0 or RAID10, or LVM. Complexity of maintenance rules out doing something like segmenting the disk by subdirectory (as another mentions volume juggling). Minimizing downtime means you have to have some form of redundancy, since the loss of one drive takes the whole array down, which you'd then have to rebuild. I read that as "downtime". Degraded mode on RAID-5 likely also rules out RAID-5 for performance reasons.

So I'd say your options are RAID10, or RAID1+LVM. LVM offers some increased ability to manage the size of the volume, but a lot of that would disappear if you're going to mirror it with RAID-1 anyway. According to this article https://www.linuxtoday.com/blog/pick-your-pleasure-raid-0-mdadm-striping-or-lvm-striping.html RAID-0 offers better performance than LVM.

Solution 3:

If you genuinely don't care about the data, only its performance and the speed to rebuild service WHEN it fails rather than to avoid failure then, against all my normal better judgement, R0 will be fine.

It doesn't let you choose what data goes where obviously, but it'll be about as fast as I can think it might be, yes it'll definitely fail but you can just have a script that removes the R0 array, rebuilds it and mounts it, shouldn't take more than a minute or so to do maximum - you could even run it automatically when you lose access to the drive.

One small question - you want a 32 x vCPU VM using Skylake cores, they don't do a single socket this big so your VM will be split across sockets, this might not be as fast as you'd expect, maybe test performance with 32/24/16 cores to see what the impact would be ok, it's worth a quick try at least.

Solution 4:

The simpler, hassle-free setup is to use a software RAID array + XFS. If, and only if, you do not care about data and availability, you can use a RAID0 array; else, I strongly suggest you using some other RAID layout. I generally suggest using RAID10 but it commands a 50% capacity penaly; for a 24x 375GB RAID you can think about RAID6 or -gasp- even RAID5.

The above solution cames with many strings attached, most importantly presenting you a single block devices and skipping any LVM-based storage partitions and meaning no snapshot capability. On the other hand, XFS allocator handles very well balancing between individual disk in a RAID0 setup.

Other possible solutions:

  • use XFS over classical LVM over RAID0/5/6: a legacy LVM volume has basically no impact on performance and enable you to both dinamically partition the single block devices and taking short-lived snapshot (albeit at very high performance penalty)

  • use XFS over thin LVM over RAID0/5/6: thin LVM enable modern snapshots, with reduced performance penalty, and other goodies. If used with a big enough chunk size performances are good

  • consider using ZFS (in its ZoL incarantion): especially if your data is compressible, it can provide significan space and performance advantages. Moreover, as you workload seems read-heavy, ZFS ARC can be more efficient than the traditional linux pagecache

If your data do not compress well but are deduplication-friendly, you can consider inserting VDO between the RAID block device and filesystem.

Finally please consider than any sort of LVM, JBOD or ZFS pooling does not means that losing a disk will only bring offline the directories located on such disks; rather, the entire virtual block device become unavailable. To have such sort of isolation, you need to lay a filesystem for each block devices: this mean you must manage the various mount points and, more importantly, that your storage is not pooled (ie: you can run out of space on a disk, while the others have plenty of free space).