Best way to load balance across multiple static file servers for even an bandwidth distribution?
First off, I'll explain my situation to you. I'm running a fairly popular website as a side project, so I can't really invest a ton of money into it. I currently have just one server with HAProxy in the front sending out normal requests to Apache, and all static file requests to Lighttpd. This is working really well because all php and post requests get handled by Apache, while all images are sent to the faster Lighttpd (the site is mostly images, so this is really important). It would be nice to not have to set up a sub-domain for serving the images, because short URLs are also really important, thus my reason for using HAProxy.
I've found a hosting provider that offers pretty cheap unmetered bandwidth that I've been using, the problem comes in when I start pushing out as much bandwidth as the 100mbs network card can handle, thus needing a second server.
I've put a lot of thought into my options, so I'll explain each one to you. Hopefully you could provide some insight into which one is the best option for me, or maybe there's another option out there that I haven't thought of yet.
Requirements:
Even bandwidth distribution is a must. I have a pretty powerful server, so scaling up is not an option. I need to scale out to gain more bandwidth.
Short URLs. I really don't wont to setup a subdomain, like img.example.com, to serve my images. example.com/image.jpg is how it is now, and how I would really like it to stay. But if there's no other way, then I understand.
The clostest server handling the request would be really nice, but not a must. Something to keep in mind.
HAProxy to loadbalance:
- It would be really easy to do since I'm already using HAProxy anyway. However, I think the problem comes in when distributing bandwidth. I might be wrong on this, but doesn't HAProxy send the request to a server where the server processes it and then sends it back through HAProxy to the client? Thus, all traffic goes back out through the load balancer causing it to use as much bandwidth as all the servers combined.
DNS Round Robin:
- This might be my best option. Just replicate the website across multiple servers and do what I'm doing now. The downside is that if one server goes down, clients are still sent to it. I would also need to replicate the site across the multiple servers. I was kind of hoping that I could have one main server that handles everything except static files, and then have a couple of static file servers. I also read that this was sort of the 'poor man's load balancing', and it would be nice to have something a little more sophisticated.
Direct Server Return:
- It seems really complicated, but might be a good option. Would I still be able to send certain URLs to certain servers? Like right now with HAProxy, every URL that ends in the right file extension is sent to Lighttpd, while other extensions are sent to Apache. So I would need something similar. Like, all php requests are handled by the same server that's running the balancing software, while all jpg requests are sent out to multiple servers.
Ideally, if HAProxy supported Direct Server Return, then my problem would be solved. I also do not want to use a CDN, because they're really expensive, and this is just a side project after all.
Do you understand my problem? Let me know if I didn't explain something right or if you need more info.
Solution 1:
Draw a picture of your request/response cycle for the application and isolate the bottleneck. You are correct that a single proxy distributing load to many application servers will require the aggregate bandwidth of all application servers. The classical solution is RR DNS. Google, Yahoo and Amazon all use this technique with a short TTL. I did some investigation a while back and documented my findings.
Another solution is to use a fancy-pants enterprise load balancing solution using virtual IP addressing to balance requests among multiple application servers with real IP addresses. I have worked with Netscaler and Stonesoft products. Both perform well but have terrible idiosyncrasies and are quite complex.
Solution 2:
Some answers:
- Yes, all traffic passes out through HAProxy, as it works as a HTTP level proxy. This will be the same even if HAProxy is installed on a separate server that load balances multiple back end servers. Thus if your hosting provider only supplies 100MBit network ports, and you're already pushing 100MBit, then you have a problem.
- Regarding domain, the optimal thing would be to serve images from a different domain than your webapp -- not a subdomain, a different one, so that cookies are not sent along on image requests. See Steve Souders original work, or the implementation here on Stack Overflow. If short URLs are very important to you, maybe the best thing would be to move the webapp off the main URL, i.e. move the file management application to login.sitename.com ?
Do you need authentication on the image requests? If not, how about using something like Amazon S3? It is massively scalable, and the data transfer cost is fairly cheap. In this case I would use somthing like i.sitename.com as a DNS CNAME for the Amazon S3 bucket hostname, see Amazons docs. AFAIK you can't have the root domain name (sitename.com) as a CNAME, so you must use a subdomain like i.sitename.com for this.
You could also hash your images across multiple servers. I.e. you create a DNS structure like login.sitename.com and a.sitename.com ; b.sitename.com ; c.sitename.com et cetera. The "a." and "b." etc servers just contain a file system with images, and a lightweight HTTP server (you're already using Lighttpd, so continue using that. For a future project, I would propose to look at nginx as a better replacement.) When a user uploads an image, you create a hash of a unique identifier, perhaps his username, perhaps the filename, or a combination of multiple identifiers. From this hash, you determine which server to store the image on.
Edit I should have seen that hashing was already discussed. Essentially what I'm proposing here is just to use hashing on the hostname as well, to spread out network traffic evenly on multiple hosts.
I don't know how cheap you need this to be -- but when you're pushing 100MBit of network traffic, then "cheap and good" quickly turns out to be an illusion. Maybe you should look at getting a good business model first, something that provides recurring revenue, and then implement the appropriate technology afterwards?
Solution 3:
I assume HAProxy is on the same server as your other applications? You could break HAProxy out on to another system to run the requests through and have it send normal requests to one server, and image requests to another server. The issue these is that all requests are still going to one box, and if you're saturating its bandwidth then that may not help you much.
You say short URLs are important. Why? Is it really that big of a deal to switch images from "example.com" to "i.example.com"? You can set "i" to its own IP on its own server with Lighttpd and bypass HAProxy entirely, solving your throughput problem. You would also get the benefit of the web browser allowing more requests open at once since it would consider them to be different domain names and could open more concurrent connections. If the single "i" server got saturated you could employ DNS round-robin to add another one. Hopefully by that time you're generating enough revenue to implement a better solution.
Solution 4:
Does your hosting provider offer load balancing services? I think is the best solution.
Another way to do it, but it need to be tested, is rewrite (in lighty or apache) the requests. For example: example.com/file.html stays in apache and example.com/image.jpg redirects to i.example.com/image.jpg . All the requests will be managed through apache but reponses (upstream bandwidth) are going to the lighttpd server. The domain is transparent to the user. Still you need to test if apache can handle all the requests or maybe let lighttpd do this job.
You're right all the data pass through HAProxy so you can't (as far as I know) do direct server return with it.
UPDATE
Looking in HAproxy documentation I found the "redir" parameter. I don't know if it can works like apache rewrite but it can be useful. The documentation says:
Main use consists in increasing bandwidth for static servers by having the clients directly connect to them.
Maybe it works for your case.
Solution 5:
I'm assuming that with any sizable set of images you're not storing the images based on their original file name as you would run into name conflicts pretty quickly.
A lot of applications that deal with these types of problems use the hash of the file and a directory structure based on that hash. The directory structure looks like the following where the directory path is the first two characters of the hash then the 2nd level directory is the next two characters in the hash.
/image root/AA/AA/images
/image root/AA/AB/images
The benefit here is that hashes keep the distribution of files pretty even and it provides you a namespace that is easy to split up over multiple servers. Basically you serve portions of the hash space from different servers and as you scale you can subdivide this further as required.
The downside is that hashes aren't perfect and there can be collisions. I'm not sure how this is dealt with. So that may take a bit of research on your part. I imagine that a rewrite rule in the proxy should be able to take a hash say A3A8BBC83261.jpg and rewrite it to http://img3.domain.com/A3/A8/BBC83261.jpg. You may not consider that to be a short url though.