MongoDB on EC2 server or AWS SimpleDB?
SimpleDB has some scalability limitations. You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. Scalability is manual (you have to shard).
If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lose the last minute of your data. Scalability is manual. It's much faster than simpledb. Autosharding is implemented in 1.6 version.
Cassandra has weak query options but is as durable as postgresql. It is as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much easier to scale than other options. It has strong .net and java client options. They have connection pooling, load balancing, marking of dead servers,...
Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).
For web apps, if your data is small I recommend mongo, if it is large cassandra is better. You don't need a caching layer with mongo or cassandra, they are already fast. I don't recommend simpledb, it also locks you to Amazon as you said.
If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.
I think you have both a question of time and speed.
MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.
On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.
In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).
Cassandra:
- Scaling / Redundancy is very core
- Configuration can be very intense
- To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.
MongoDB:
- Configuration is relatively easy (even for the new sharding, this week)
- Redundancy is still "getting there"
- Map-reduce is built-in and it's easy to get data out.
Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.
In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.
Now 5 years later it is not hard to set up Mongo on any OS. Documentation is easy to follow, so I do not see setting up Mongo as a problem. Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has):
I will use S for SimpleDB and M for Mongo.
- M is written in C++, S is written in Erlang (not the fastest language)
- M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
- S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
- maximum size of domain (table) is 10 GB
- attribute value length (size of field) is 1024 bytes
- maximum items in Select response - 2500
- maximum response size for Select (the maximum amount of data S can return you) - 1Mb
- S supports only a few languages (java, php, python, ruby, .net), M supports way more
- both support REST
- S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
- with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
- both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
- in S sorting can be done only on one field
- because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
- everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
- both do not have transactions and joins
- M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
- S support just a single index, M has single, compound, multi-key, geospatial etc.
- both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by
, sum
average
, distinct
as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.