150 TB and growing, but how to grow?

Solution 1:

I hope this is gonna help a little. I tried to not let it turn into a full wall of text. :)

3Par/Isilon

If you can and will dedicate a fixed amount of man-hours for someone who takes the SAN admin role and wanna enjoy a painless life with night-sleep instead of night-work then this is the way I'd go.

A SAN lets you do all the stuff where a single "storage" would limit you (i.e. connect a purestorage flash array and a big 3par sata monster to the same server), but you also have to pay for it and keep it well maintained all the time if you wanna make use of the flexibility.

Alternatives

Amplidata

Pros: Scale out, cheap, designed with a nice concept and dedicated read/write cache layers. This might actually be the best thing for you.

RisingTideOS

Their target software is used in almost all linux storages now and it's allowing for a little better management than plain linux / gluster stuff could. (Imho) The commercial version might be worth a look.

Gluster/btrfs

PRO: Scales out and "Bricks" give you an abstraction layer that is very good for management.

CON: The first has been a total PITA for me. It was not robust, and failures could be either local to one brick or take out everything. Now, with RedHat in control it might actually turn into something working and i've even met people who can tame it so that it works for years. And the second is still half-experimental. Normally a FS needs 3-4 years after it's "done" till it's proven and robust. If you care for the data, why would you ever consider this? Talking of experimental, Ceph commercial support is almost out now, but you'd need to stick to the "RBD" layer, the FS is just not well-tested enough yet. I wanna make it clear though that Ceph is much more attractive in the long run. :)

ZFS

Pro: Features that definitely put a nail in other stuff's coffin. Those features are well-designed (think L2ARC) and compression/dedup is fun. Have more "storage clusters" meaning having also just small failures instead of one large consolidated boom

Con: Maintaining many small software boxes instead of a real storage. Need to integrate them and spend $$$ hours to have a robust setup.

Solution 2:

The XFS + LVM route is indeed one of the best options for a scaled out pure-Linux storage solution in the past few years. I'm encouraged you're there already. Now that you need to grow more, you do have a few more options available to you.

As you know, the big hardware vendors out there do have NAS-heads for their storage. This would indeed give you a single vendor to work with to make it all happen, and it would work pretty well. They're easy solutions to get in (compared to DIY), and their maintainability is lower. But, they cost quite a lot. On the one hand you'll have more engineering resources for solving your main problems rather than infrastructure problems; on the other hand, if you're like most University departments I've known man-power is really cheap relative to paying cash for things.

Going the DIY route you already have a good appreciation of the DIY options available to you. ZFS/BTRFS are the obvious upgrade path from XFS + LVM for scaled out storage. I'd steer clear of BTRFS until it gets declared 'stable' in the Linux mainline kernel, which should be pretty soon now that several of the major free distros are using it as the default filesystem. For ZFS, I'd recommend using a BSD base rather than OpenIndiana simply because it's been around longer and has the kinks (more) worked out.

Gluster was designed for the use-case you describe here. It can do replication as well as present a single virtual server with lots of storage attached. Their Distributed Volumes sound exactly what you're looking for, since they spread the files over all the storage-servers on the declared volume. You can continue to add discrete storage servers to continue to expand the visible volume. Single name-space!

The gotcha with Gluster is that it works best when your clients can use the Gluster Client to access the system rather than the CIFS or NFS options. Since you're running a small cluster-compute cluster, you may just be able to utilize the GlusterFS client.

You're on the right track here.

Solution 3:

As far as i understand you could use a SAN solution based on the Linux SCST + FibreChannel or infiniband, Which is something im building right now. As a base for the LUNs you could use LVM on top of hardware RAIDs and take care of snapshots/replication (take DRBD as an example) below the file system level. As a filesystem I'm not aware of any good solution for concurency as I'm putting ESXi on top of the nodes, so the datastores are managed by ESX concurrent FS. I think GFS2 might work with that environment but im not 100% sure, as you should check your precise requirements. Anyway once you have a robust SAN underneath your nodes, it's pretty easy to get things done.