Why NoSQL is better at "scaling out" than RDBMS?

I have read the following text in a technical blog discussing the advantages and disadvantages of NoSQL:

"For years, in order to improve performance on database servers, database administrators have had to buy bigger servers as the database load increases (scaling up) instead of distributing the database across multiple “hosts” as the load increases (scaling out). RDBMS do not typically scale out easily, but the newer NoSQL databases are actually designed to expand easily to take advantage of new nodes and are usually designed with low-cost commodity hardware in mind."

I became confused about the scalability of RDBMS and NoSQL.

My confusion are:

  1. Why RDBMS are less able to scale out? And the reason of buying bigger servers instead of buying more cheap ones.
  2. Why NoSQL is more able to scale out?

So I've been trying to figure out the real bottom-line when it comes to NoSQL vs RDBMS myself, and always end up with a response that doesn't quite cut it. In my search there are really 2 primary differences between NoSQL and SQL, with only 1 being a true advantage.

  1. ACID vs BASE - NoSQL typically leaves out some of the ACID features of SQL, sort of 'cheating' it's way to higher performance by leaving this layer of abstraction to the programmer. This has already been covered by previous posters.

  2. Horizontal Scaling - The real advantage of NoSQL is horizontal scaling, aka sharding. Considering NoSQL 'documents' are sort of a 'self-contained' object, objects can be on different servers without worrying about joining rows from multiple servers, as is the case with the relational model.

Let's say we want to return an object like this:

post {
    id: 1
    title: 'My post'
    content: 'The content'
    comments: {
      comment: {
        id: 1
      }
      comment: {
        id: 2
      }
      ...

    views: {
      view: {
        user: 1
      }
      view: {
        user: 2
      }
      ...
    }
}

In NoSQL, that object would basically be stored as is, and therefore can reside on a single server as a sort of self-contained object, without any need to join with data from other tables that could reside on other DB servers.

However, with Relational DBs, the post would need to join with comments from the comments table, as well as views from the views table. This wouldn't be a problem in SQL ~UNTIL~ the DB is broken into shards, in which case 'comment 1' could be on one DB server, while 'comment 2' yet on another DB server. This makes it much more difficult to create the very same object in a RDBMS that has been scaled horizontally than in a NoSQL DB.

Would any DB experts out there confirm or argue these points?


RDBMS have ACID ( http://en.wikipedia.org/wiki/ACID ) and supports transactions. Scaling "out" with RDBMS is harder to implement due to these concepts.

NoSQL solutions usually offer record-level atomicity, but cannot guarantee a series of operations will succeed (transaction).

It comes down to: to keep data integrity and support transactions, a multi-server RDBMS would need to have a fast backend communication channel to synchronize all possible transactions and writes, while preventing/handling deadlock.

This is why you usually only see 1 master (writer) and multiple slaves (readers).


Typical RDBMs make strong guaranties about consistency. This requires to some extent communication between nodes for every transaction. This limits the ability to scale out, because more nodes means more communications.

NoSQL systems make different tradeoffs. For example they don't guarantee that a second session will immediately see the data committed by the first session. Thereby decoupling the transaction of storing some data from the process of making that data available for every user. Google "eventually consistent". So a single transaction doesn't need to wait for any (or for much less) inter node communication. Therefore they are able to utilize a large amount of nodes much more easily.


Why NoSQL databases can be easily horizontally scaled than SQL ones? I have been trying to figure out why people keep saying this. I came across many articles which only confused me with their not-industry familiar terminologies and vague assumptions. I will suggest you read Designing Data-intensive applications by Martin Kleppman. Also, I will share some of my understanding of this subject.

JOINS - in the case of many-to-one or many-to-many relationships there is no way that any database invented till now can keep the data together in one table or document so if the data is sharded(or partitioned), either it is SQL or NoSQL, the latency will be same, the database has to look for both the documents. NoSQL seems to dominate only in the case of one to many relationships. For example:

NoSql

Student

{
  "name": "manvendra",
  "education": [
    {
      "id": 1,
      "Degree": "High School"
    },
    {
      "id": 2,
      "Degree": "B.Tech"
    }
  ]
}

Eduction Institute collection

[
  {
    "id": "1",
    "name": "army public school"
  },
  {
    "id": "2",
    "name": "ABES Engineering College"
  }
]

Sql

Student Table

id | name        
1  | Manvendra

Education Institute

id | Name
1  | Army public school
2  | ABES Engineering college

Studies Table

student  | education institute | degree
1        | 1                   | high school
1        | 2                   | B.tech

Now suppose in the case of NoSql if both collection's data is on different nodes there will some extra time required to resolve the ids of the education institute and this situation is similar in the case of SQL databases so where is the benefit? I can't think of any.

Also, you must be thinking why can't we store the education institute info also in the same student collection, then it will be like:

{
  "name": "manvendra",
  "education": [
    {
      "name": "Army public school",
      "Degree": "High School"
    },
    {
      "name": "ABES Engineering College",
      "Degree": "B.Tech"
    }
  ]
}

which is really a bad design because there is a many-to-many relationship between student and education institute, many students might have studied from the same institute so tomorrow if there is a change in name or any information of the institute it will be really a very difficult challenge to change at all places.

However, in the case of a one-to-many relationship, we can club all the info together for example: Consider a customer and an order relationship

{
  "name": "manvendra",
  "order": [
    {
      "item": "kindle",
      "price": "7999"
    },
    {
      "item":"iphone 12",
      "price":"too much"
    }
  ]
}

Since an order only belongs to one customer it makes sense to store order info in one place however storing item id or name is another choice anyway, if we use SQL database here, there will be two tables with orders and customers which will not give good results to queries if data is not stored in the same node.

So saying joins in an argument as to why the NoSql database is easier to scale horizontally does not make sense.

TRANSACTIONS

Both SQL(Postgres, MySQL, etc) and NoSQL(MongoDB, Amazon's DynamoDB, etc) support transactions so there is nothing left to discuss on that.

ACID

ACID is overused just like CAP actually it is all about showing a single copy of data to the client instead actually there might be multiple copies of data(to enhance availability, fault-tolerance, etc) and what strategies the database uses to do that. For example in Postgres in the case of a master-slave distributed system, one can opt for synchronous or asynchronous replication and the replication is made possible with WAL(Write ahead logs) and same is the case in MongoDB, only in place of WAL it has oplog(Operations Log), both support streaming replication and failovers. Then where is the difference? Actually, I can't find a very strong reason that why NoSql databases can be scaled easily. What I can say is NoSql is the latest so databases come with ready-made support for horizontal scaling for example consider Mongos in MongoDB, they do all the dirty work of sharding documents, routing requests to the specific shard, etc. So tomorrow if Postgres or MySQL come up with some mechanism of intelligently sharding tables so all the related data is mostly kept in one node then it may put an end to this debate because there is nothing intrinsic in a relational database that prevents it from horizontal scaling.

On an optimistic note, I believe in the near future it will all be about the strategies. How you are planning to scale and those strategies will be independent of how you are storing data either in tables or documents. For example in Amazon's DocumentDB, there is a concept of auto-scaling in and out but if you want to achieve this with sharding it will be a burden to copy data each time you are scaling in and out. In DocumentDB this is taken care of as a shared cluster volume(data storage is separated from computing) which is nothing but a shared disk to all the instances(primary or secondary) and to escape from the risk of the shared disk failure DocumentDB replicates data of the shared disk to six other disks in different availability zones. So point to be noted here is DocumentDB mixed the concept of the shared disk and standard replication strategy to achieve its goal. So it is all about the strategy you are using in your database which is what matters