MogileFS/GlusterFS/etc + Amazon EBS + Amazon EC2

I have a web application that serves binary files (images, etc). Our application runs on Amazon EC2. We were originally going to use Amazon S3 to store and serve these files, this is no longer an option.

We need to transfer these files over HTTPS using a CNAME. This is obviously impossible with Amazon S3 for many technical reasons. Amazon offers Elastic Block Storage (EBS) which allow you to mount a block of up to 1TB in size to one instance. We will have multiple instances accessing this data in parallel.

What I was thinking is using a distributed file system like MogileFS/GluserFS/[insert-more-here] with Elastic Block Storage (EBS).

So my question is: What are others currently doing to create a scalable (a few 100TBs) file storage system over Amazon EC2 without using Amazon S3 thats redundant? Data will still be backed up on Amazon S3 but all reads would be off the file system.

Thanks in advanced. If anyone needs clarification on anything please feel free to ask.

Solution 1:

In ~~Azouk~~ (formerly linked domain dormant/parked) we don't use Amazon EC2, but we use GlusterFS (1.4.0qa92) for serving all content like PDFs, user files, thumbnails and also for offline data analysis. IMHO there should be no problem deploying same architecture on Amazon's cloud — we already heavily use virtualization (OpenVZ in particular). The only potential constraint is mounting GFS via fuse (virtualization could forbid this), but AFAIK it's possible on Amazon.

So, I recommend Gluster and sorry I can't help specifically with Amazon :)

Solution 2:

A terribly old question that suddenly bubbled up on the frontpage again... :-)

So my question is: What are others currently doing to create a scalable (a few 100TBs) file storage system over Amazon EC2 without using Amazon S3 thats redundant?

Nothing, on AWS you would use S3 for 100 TB's BLOB storage, anything else would be nonsensical.

We need to transfer these files over HTTPS using a CNAME. This is obviously impossible with Amazon S3 for many technical reasons.

True, but it is possible by other means.

Since you need HTTPS access on your own domain name, you would set up a couple of HTTPS servers (or proxies) on EC2 nodes, to act as SSL encryption/decryption gateways between the Internet and S3.

I have never worked with Apache Traffic Server (formerly Inktomi), but it looks like a great fit for this. Otherwise nginx or Apache could be used for the SSL handling, together with Squid or Varnish if you want caching.

At high level, the request-response looks something like this:

Internet request via https -->
(optional) Elastic Load Balancing -->
EC2 instance with SSL capable HTTP proxy (fx nginx) -->
plain unencrypted http to S3

In addition, you'll need a deterministic way to handle URL rewriting. Fx. https://secure.yourdomain.com/<id> is rewritten to http://<bucket>.s3.amazonaws.com/<id>

Script to kill process at logoff doesn't execute until process is dead?

How to create a Hyper-V Internal network

What are some TCP tuning tips for a service hit by iPhone clients on mobile networks like 3G?

Is continuous backup workable for shared flat-file databases?

Can I run SQL Server Management Studio in Windows 7?

Is it possible to have a Git repo in Windows Azure?

Anyone have any experiences with SuperMicro Layer 3 Switches (SSE-G48-TG4 in particular)

Setting up two Exchange servers to receive mail on their old domains while sending and receiving using the same new domain

2 drives, slow software RAID1 (md)

Acessing a mercurial repository on a shared host over SSH

What are the basic tasks that need to be done to keep an SQL Server 2005 database healthy

Can't connect to MySql server on remote server