Distributed, Parallel, Fault-tolerant File System

There are so many choices that it's hard to know where to start. My requirements are these:

  • Runs on Linux
  • Most of the files will be between 5-9 MB in size. There will also be a significant number of small-ish jpgs (100px x 100px).
  • All of the files need to be available over http.
  • Redundancy -- ideally it would provide the space efficiency similar to RAID 5 of 75% (in RAID 5 this would be calculated thus: with 4 identical disks, 25% of the space is used for parity => 75% efficent)
  • Must support several petabytes of data
  • scalable
  • runs on commodity hardware

In addition, I look for these qualities, though they are not "requirements":

  • Stable, mature file system
  • Lots of momentum and support
  • etc

I would like some input as to which file system works best for the given requirements. Some people at my organization are leaning towards MogileFS, but I'm not convinced of the stability and momentum of that project. GlusterFS and Lustre, based on my limited research, appear to be better supported...

Thoughts?


Solution 1:

If it were me, I would be using GlusterFS. The current release is pretty solid and I know people at some very large installations in both the HPC and Internet space that are relying on it in their production systems. You can basically tailor it to your needs by laying out the components as you need them. Unlike Lustre, there are no dedicated metadata servers so central points of failure are minimized, and it's easier to scale the setup.

Unfortunately I don't think there's an easy way to meet your 75% criteria without throwing performance down the drain.

It does run on commodity hardware, however the performance really shines when using Infiniband interconnect. Fortunately the price of IB is really quite low these days.

You might want to check out the guys at Scalable Informatics and their Jackrabbit products as a solution. They support GlusterFS on their hardware, and the price of their solution certainly rivals the cost of putting something together from scratch.

Solution 2:

Actually, I don't think there are that many realistic options. In order of preference my picks would be:

  1. Amazon S3. Meets all your requirements, and your optional qualities too. Has a very good track record of uptime and support. It is not in-house; but is that really not a requirement you could work around, f.x. using VPN access or just good old HTTPS... S3 would really be my first choice, if the WAN latency and Amazons pricing work for you. And if the pricing doesn't work for you, well, I doubt a DYI solution will really end up significantly less expensive...
  2. MogileFS seems to fit your requirements perfectly. There is not that much activity around MogileFS, but that's mostly because it's working as intended for its (relatively few) users.
  3. Lustre has really great technology behind it, is a regular local POSIX filesystem (if that is beneficial for you), and has been continuously updated over the years. The big question is whether the whole Sun - Oracle merger will impact Lustre. Long-term, if Sun plays its cards right, then having ZFS and Lustre under one roof could lead to very nice things... Right now, I think Lustre is mostly used in academic and commercial HPC initiatives and not in Internet applications -- this may be untrue, but if Lustre is doing well in Internet applications then they are sure not marketing that fact well...

Hadoop Distributed File System (HDFS) would not match your requirements IMHO. HDFS is awesome, but its bigtable-like approach mean its less accessible than the filesystems above. Of course, if you're really looking for massive scalability and a long-term perspective, then HDFS may be just right -- with Yahoo, Facebook and others invested in Hadoop's growth.

One comment, most of the above systems copy the whole file to 2-3 nodes to achieve redundancy. This takes up mcuh more space than parity encoding / RAID schemes, but it is manageable at scale, and it seems to be the solution everyone has taken. So you will not get the 75% efficiency that you mention...

Solution 3:

It may not meet all your requirements, particularly space efficiency (by default it splits each block into 10 shares, any 3 of which can provide recovery (though this can be tweaked down)), but you still might want to take a look at Tahoe-LAFS.

It's primarily designed for backup, but people have built (and are still building) a lot of very interesting no-backup apps on top of it. One of the developers hosts his blog on it, for instance.

GPL, written in Python. It's already included in Ubuntu, IIRC.

Solution 4:

Moose File System seems to fits to your requirements. It runs on Linux (for example Ubuntu or Debian) and commodity hardware. Storage can have size up to 16 exabytes (~16000 petabytes). Moreover it's mature (released on 2008), fault-tolerant distributed file system with great support.