MogileFS/GlusterFS/etc + Amazon EBS + Amazon EC2

I have a web application that serves binary files (images, etc). Our application runs on Amazon EC2. We were originally going to use Amazon S3 to store and serve these files, this is no longer an option.

We need to transfer these files over HTTPS using a CNAME. This is obviously impossible with Amazon S3 for many technical reasons. Amazon offers Elastic Block Storage (EBS) which allow you to mount a block of up to 1TB in size to one instance. We will have multiple instances accessing this data in parallel.

What I was thinking is using a distributed file system like MogileFS/GluserFS/[insert-more-here] with Elastic Block Storage (EBS).

So my question is: What are others currently doing to create a scalable (a few 100TBs) file storage system over Amazon EC2 without using Amazon S3 thats redundant? Data will still be backed up on Amazon S3 but all reads would be off the file system.

Thanks in advanced. If anyone needs clarification on anything please feel free to ask.


Solution 1:

In Azouk (formerly linked domain dormant/parked) we don't use Amazon EC2, but we use GlusterFS (1.4.0qa92) for serving all content like PDFs, user files, thumbnails and also for offline data analysis. IMHO there should be no problem deploying same architecture on Amazon's cloud — we already heavily use virtualization (OpenVZ in particular). The only potential constraint is mounting GFS via fuse (virtualization could forbid this), but AFAIK it's possible on Amazon.

So, I recommend Gluster and sorry I can't help specifically with Amazon :)

Solution 2:

A terribly old question that suddenly bubbled up on the frontpage again... :-)

So my question is: What are others currently doing to create a scalable (a few 100TBs) file storage system over Amazon EC2 without using Amazon S3 thats redundant?

Nothing, on AWS you would use S3 for 100 TB's BLOB storage, anything else would be nonsensical.

We need to transfer these files over HTTPS using a CNAME. This is obviously impossible with Amazon S3 for many technical reasons.

True, but it is possible by other means.

Since you need HTTPS access on your own domain name, you would set up a couple of HTTPS servers (or proxies) on EC2 nodes, to act as SSL encryption/decryption gateways between the Internet and S3.

I have never worked with Apache Traffic Server (formerly Inktomi), but it looks like a great fit for this. Otherwise nginx or Apache could be used for the SSL handling, together with Squid or Varnish if you want caching.

At high level, the request-response looks something like this:

Internet request via https -->
(optional) Elastic Load Balancing -->
EC2 instance with SSL capable HTTP proxy (fx nginx) -->
plain unencrypted http to S3

In addition, you'll need a deterministic way to handle URL rewriting. Fx. https://secure.yourdomain.com/<id> is rewritten to http://<bucket>.s3.amazonaws.com/<id>