In Mongo what is the difference between sharding and replication?
In the context of scaling MongoDB:
replication creates additional copies of the data and allows for automatic failover to another node. Replication may help with horizontal scaling of reads if you are OK to read data that potentially isn't the latest.
sharding allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key. It's important to choose a good shard key. For example, a poor choice of shard key could lead to "hot spots" of data only being written on a single shard.
A sharded environment does add more complexity because MongoDB now has to manage distributing data and requests between shards -- additional configuration and routing processes are added to manage those aspects.
Replication and sharding are typically combined to created a sharded cluster where each shard is supported by a replica set.
From a client application point of view you also have some control in relation to the replication/sharding interaction, in particular:
- Read preferences
- Write concerns
Consider you have a great music collection on your hard disk, you store the music in logical order based on year of release in different folders. You are concerned that your collection will be lost if drive fails. So you get a new disk and occasionally copy the entire collection keeping the same folder structure.
Sharding >> Keeping your music files in different folders
Replication >> Syncing your collection to other drives
Replication is a mostly traditional master/slave setup, data is synced to backup members and if the primary fails one of them can take its place. It is a reasonably simple tool. It's primarily meant for redundancy, although you can scale reads by adding replica set members. That's a little complicated, but works very well for some apps.
Sharding sits on top of replication, usually. "Shards" in MongoDB are just replica sets with something called a "router" in front of them. Your application will connect to the router, issue queries, and it will decide which replica set (shard) to forward things on to. It's significantly more complex than a single replica set because you have the router and config servers to deal with (these keep track of what data is stored where).
If you want to scale Mongo horizontally, you'd shard. 10gen likes to call the router/config server setup auto-sharding. It's possible to do a more ghetto form of sharding where you have the app decide which DB to write to as well.
Sharding
Sharding is a technique of splitting up a large collection amongst multiple servers. When we shard, we deploy multiple mongod
servers. And in the front, mongos
which is a router. The application talks to this router. This router then talks to various servers, the mongod
s. The application and the mongos
are usually co-located on the same server. We can have multiple mongos
services running on the same machine. It's also recommended to keep set of multiple mongod
s (together called replica set), instead of one single mongod
on each server. A replica set keeps the data in sync across several different instances so that if one of them goes down, we won't lose any data. Logically, each replica set can be seen as a shard. It's transparent to the application, the way MongoDB
chooses to shard is we choose a shard key.
Assume, for student
collection we have stdt_id
as the shard key or it could be a compound key. And the mongos
server, it's a range based system. So based on the stdt_id
that we send as the shard key, it'll send the request to the right mongod
instance.
So, what do we need to really know as a developer?
-
insert
must include a shard key, so if it's a multi-parted shard key, we must include the entire shard key - we've to understand what the shard key is on collection itself
- for an
update
,remove
,find
- ifmongos
is not given a shard key - then it's going to have to broadcast the request to all the different shards that cover the collection. - for an
update
- if we don't specify the entire shard key, we have to make it a multi update so that it knows that it needs to broadcast it