Setting up a Rails 2.3.x app on EC2 for easy scalability
Solution 1:
So, most of what you've got there is pretty straightforward to scale. PgSQL onto it's own machine to relieve a pile of CPU/disk IO/memory consumption as a first step. Sphinx into it's own little world too. Definitely switch to resque to allow easy horizontal scaling of your workers.
But the files... yes, the files are difficult. They're always difficult. And your options are perilously thin.
Some people will recommend the clustered filesystem route, either magic superclustering (GFS2/OCFS2) or the slightly poorer-mans option of something like GlusterFS. I've run a lot of systems (1000+) using GFS, and I'd never, ever do it again. (It may not even run on EC2's network, actually). GFS2/OCFS2 are a mess of moving parts, under-documented, overcomplicated, and prone to confusing failure modes that just give you downtime hassle. It also doesn't perform worth a damn, especially in a write-heavy environment -- it just falls over, taking your entire cluster down and taking 10-30 minutes of guru-level work to get it up and running again. Avoid it, and your life is a lot easier. I've never run GlusterFS, but that's because I've never been particularly impressed with it. Anything you can do with it, there's usually a better way to do it anyway.
A better option, in my opinion, is the venerable NFS server. One machine with a big (or not so big) EBS volume and an NFS daemon running, and everyone mounts it. It has it's gotchas (it's not really a POSIX filesystem, so don't treat it as such), but for simple "there, I fixed it" operation for 99% of use cases, it's not bad. At the very least, it can be a stopgap while you work on a better solution, which is...
Use your knowledge of your application to tier out your storage. This is the approach I took most recently (scaling Github), and it's worked beautifully. Basically, rather than treating file storage as a filesystem, you treat it like a service -- provide an intelligent API for the file-storage-using parts of your application to use to do what you need to do. In your case, you might just need to be able to store and retrieve images to pre-allocated IDs (the PK for your "images" table in the DB, for instance). That doesn't need a whole POSIX filesystem, it just needs a couple of super-optimised HTTP methods (the POSTs need to be handled dynamically, but if you're really smart you can make the GETs come straight off disk as static files). Hell, you're probably serving those images straight back to customers, so cut out the middle man and make that server your publically-accessable assets server while you're at it.
The workflow might then be something like:
- Frontend server gets image
- POSTs it into the fileserver
- Adds job to get image processed
- (Alternately, the POST to the fileserver causes it to recognise the need for a post-processing job, and it creates the job all by itself)
- Worker gets image processing job
- Retrieves image from fileserver
- Processes image
- POSTs the processed image back to the fileserver
- Webpage needs to include image in webpage
- Writes URL to images server into HTML
- web browser goes and gets image directly
You don't necessarily have to use HTTP POST to put the images onto the server, either -- Github, for instance, talks to it's fileservers using Git-over-SSH, which works fine for them. The key, though, is to put more thought into where work has to be done, and avoid unnecessary use of scarce resources (network IO for an NFS server request), trading instead for a more constrained set of use options ("You can only request whole files atomically!") that work for your application.
Finally, you're going to have to consider the scalability factor. If you're using an intelligent transport, though, that's easy -- add the (small amount of) logic required to determine which fileserver you need to talk to in each place that talks to the fileserver, and you've basically got infinite scalability. In your situation, you might realise that your first fileserver is going to be full at, say, 750,000 images. So your rule is "Talk to the fileserver with hostname fs#{image_id / 750_000}
", which isn't hard to encode everywhere.
I always used to have fun with this part of my job...