Which Distributed File System as a backend for Cloud Computing? [duplicate]
While I haven't personally implemented it anywhere in our systems, I have looked pretty extensively at Gluster. I know a few people at some large sites that use this and it apparently works really well. They use it in production for some heavy duty HPC applications.
GlusterFS would seem like the ideal solution to me. To the guy who claims that Gluster takes lots of effort to set up I've got to say that he's probably never tried. As of Gluster 3.2 the configuration utilities are pretty awesome and it takes 2 or 3 commands to get a gluster volume up and sharing on the network. Mounting gluster volumes is equally simple.
On the plus side it also gives you a lot more flexibility than NFS. It does striping, relication, georeplication, is of course POSIX compliant and so on. There is an extension called HekaFS, which also adds SSL and more advanced Authentification mechanisms, which is probably interesting for cloud computing. Also it scales! It is F/OSS and is being developed by RedHat who've recently purchased Gluster.
Have you ever looked at mogileFS? http://danga.com/mogilefs/
It's not a file system in the traditional sense, but it is good for distributing file data across a cluster (with replication and redundancy taken into account).
If you're serving up files for a web application you will need something to serve the files. I would suggest a PHP script that uses the HTTP request as the search key for finding the file you want in the mogile FS. You can then read the contents of the file into a buffer and echo/print it out.
MogileFS is already pretty quick, but you can combine mogileFS with memcache to speed up access to the most commonly used files.
With Lustre you have to have a special kernel on the servers, and I would only have the servers being servers and nothing else.
Strangely the most sane answer much well be NFS. We have used NFS on Amazon's cloud. It may not scale as well as some file systems but the simplicity should not me overlooked. A single name space is probably not worth the effort it would take to implement.
Are you still looking into HDFS? One of the Cloudera guys gave a talk at VelocityConf this year about Hadoop and HDFS focused on managing big data clusters, so he talked about HDFS quite a bit. The slides are pretty informative. I haven't worked with HDFS personally, but I talked with some random folks at Velocity that are using it on Ubuntu to do various data analysis.
- Slides
- Talk info