Distributed Storage Filesystem - Which one/Is there a ready to use product?
With Hadoop and CouchDB all over in Blogs and related news what's a distributed-fault-tolerant storage (engine) that actually works.
- CouchDB doesn't actually have any distribution features built-in, to my knowledge the glue to automagically distribute entries or even whole databases is simply missing.
- Hadoop seems to be very widely used - at least it gets good press, but still has a single point of failure: The NameNode. Plus, it's only mountable via FUSE, I understand the HDFS isn't actually the main goal of Hadoop
- GlusterFS does have a shared nothing concept but lately I read several posts that lead me to the opinion it's not quite as stable
- Lustre also has a single point of failure as it uses a dedicated metadata server
- Ceph seems to be the player of choice but the homepage states it is still in it's alpha stages.
So the question is which distributed filesystem has the following feature set (no particular order):
- POSIX-compatible
- easy addition/removal of nodes
- shared-nothing concept
- runs on cheap hardware (AMD Geode or VIA Eden class processors)
- authentication/authorization built-in
- a network file system (I'd like to be able to mount it simultaneously on different hosts)
Nice to have:
- locally accessible files: I can take a node down mount the partition with a standard local file system (ext3/xfs/whatever...) and still access the files
I'm not looking for hosted applications, rather something that will allow me to take say 10GB of each of our hardware boxes and have that storage available in our network, easily mountable on a multitude of hosts.
Solution 1:
I think you'll have to abandon the POSIX requirement, very few systems implement that - in fact even NFS doesn't really (think locks etc) and that has no redundancy.
Any system which uses synchronous replication is going to be glacially slow; any system which has asynchronous replication (or "eventual consistency") is going to violate POSIX rules and not behave like a "conventional" filesystem.
Solution 2:
I can't speak to the rest, but you seem to be confused between a 'distributed storage engine' and a 'distributed file system'. They are not the same thing, they shouldn't be mistaken for the same thing, and they will never be the same thing. A filesystem is a way to keep track of where things are located on a hard drive. A storage engine like hadoop is a way to keep track of a chunk of data identified by a key. Conceptually, not much difference. The problem is that a filesystem is a dependency of a storage engine... after all, it needs a way to write to a block device, doesn't it?
All that aside, I can speak to the use of ocfs2 as a distributed filesystem in a production environment. If you don't want the gritty details, stop reading after this line: It's kinda cool, but it may mean more downtime than you think it does.
We've been running ocfs2 in a production environment for the past couple of years. It's OK, but it's not great for a lot of applications. You should really look at your requirements and figure out what they are -- you might find that you have a lot more latitude for faults than you thought you did.
As an example, ocfs2 has a journal for each machine in the cluster that's going to mount the partition. So let's say you've got four web machines, and when you make that partition using mkfs.ocfs2, you specify that there will be six machines total to give yourself some room to grow. Each of those journals takes up space, which reduces the amount of data you can store on the disks. Now, let's say you need to scale to seven machines. In that situation, you need to take down the entire cluster (i.e. unmount all of the ocfs2 partitions) and use the tunefs.ocfs2 utility to create an additional journal, provided that there's space available. Then and only then can you add the seventh machine to the cluster (which requires you to distribute a text file to the rest of the cluster unless you're using a utility), bring everything back up, and then mount the partition on all seven machines.
See what I mean? It's supposed to be high availability, which is supposed to mean 'always online', but right there you've got a bunch of downtime... and god forbid you're crowded for disk space. You DON'T want to see what happens when you crowd ocfs2.
Keep in mind that evms, which used to be the 'preferred' way to manage ocfs2 clusters, has gone the way of the dodo bird in favor of clvmd and lvm2. (And good riddance to evms.) Also, heartbeat is quickly going to turn into a zombie project in favor of the openais/pacemaker stack. (Aside: When doing the initial cluster configuration for ocfs2, you can specify 'pcmk' as the cluster engine as opposed to heartbeat. No, this isn't documented.)
For what it's worth, we've gone back to nfs managed by pacemaker, because the few seconds of downtime or a few dropped tcp packets as pacemaker migrates an nfs share to another machine is trivial compared to the amount of downtime we were seeing for basic shared storage operations like adding machines when using ocfs2.
Solution 3:
I may be misunderstanding your requirements, but have you looked at http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems