(*nix) Cloud/cluster solutions for scalable web-services

I'm going to build a high-performance web service. It should use a database (or any other storage system), some processing language (either scripting or not), and a web-server daemon. The system should be distributed to a large amount of servers so the service runs fast and reliable.

It should replicate data to achieve reliability and at the same time it must provide distributed computing features in order to process large amounts of data (primarily, queries on large databases that won't survive being executed on a single server with a suitable level of responsiveness). Caching techniques are out of the subject.

Which cluster/cloud solutions I should take for the consideration?

There are plenty of Single-System-Image (SSI), clustering file systems (can be a part of the design), projects like Hadoop, BigTable clones, and many others. Each has its pros and cons, and "about" page always says the solution is great :) If you've tried to deploy something that addresses the subject - share your experience!

UPD: It's not a file hosting and not a game, but something rather interactive. You can take ServerFault as an example of a web-service: small pieces of data, semi-static content, intensive database operations.


For those who might be interested:

Cross-Post on StackOverflow

Related questions:

  • What cluster management software to use for linux?
  • What distributed shell utilities do people feel are good, flexible, and easy to use?

Solution 1:

Facebook is using cassandra for data storage.

Here is article about scaling youtube and google architecture and prestentation: Designs, Lessons and Advice from Building Large Distributed Systems by Jeff Dean of Google describing how they do their thing.

Solution 2:

Hadoop + Hive (or PIG) is built for dealing with massive data. This is what Yahoo (4000 node cluster), Facebook, eHarmony etc use.

I believe you can get branded packages/support from Cloudera.com, or you can get it yourself at apache.org

It is wicked easy to setup and it is awesome when dealing with GB-PB of data queries.

You could easily test it out on EC2 (that is one of their options) for almost no cost.

Solution 3:

It's impossible to answer without knowing exactly what you're doing; it may be rather difficult even then.

Based on what I've read (and tried out), Cassandra seems pretty good, but you should not consider it as part of a design without understanding exactly how it works and what its limitations are.

This kind of thing is never easy, and moreover, this is more of a question for Stackoverflow.

Solution 4:

I somewhat freely take the gist of OPs request to be "a mature cloud computing platform that is easy to grok for programmers, and easy to scale for operations". We are not quite there yet; to the best of my knowledge there are no mature, commercially available systems which span the entire chain from HTTP request, over processing, to permanent storage.

The closest thing today is probably partitioned data grid middleware like Oracle's Coherence or maybe Terracotta. Oracle Coherence has been good for Squarespace and other web applications. Of course Oracle would also happily sell you a partitioned Oracle database that can handle massive amounts of data and just works. And the price ... if you need to ask you can't afford it.

If you need cheap(er), then you're looking at some degree of do-it-yourself using open source components. The Hadoop family is the most comprehensive and mature open-source "BigTable" and "Map/Reduce" like set of tools. Sharded MySQL is popular for data storage, and is still a mostly DIY solution. "NoSQL" is gaining momentum right now, but it's still early days for NoSQL.

Which cluster/cloud solutions I should take for the consideration?

Don't you have it backwards? What evidence do you have of your application reaching Internet scale, what are the observed data access patterns at current scale like, and which solutions & languages does your team have prior experience with?