Does it make sense to dockerize (containerize) databases?
I can understand the benfits behind dockerizing stateless services, such as web servers, appservers, load balancers, etc... If you are running these services on a cluster of machines, it is very easy to move these containers around with low overhead. What I don't understand though is the purpose behind containerizing databases? databases are connected to a data volume that is persistent in a specific hard disk. Because of state, it is not easy, and not efficient to actually move the database container around. So can anyone see why dockerizing a database can be useful at all?
Solution 1:
"So can anyone see why dockerizing a database can be useful at all?"
Good question Keeto. One of the main reasons for containerizing your databases is so that you can have the same consistent environment for your entire app, not just the stateless parts, across dev, staging and production. A consistent environment is one of the promises of docker, but when your database lives outside this model, there is a big difference that can't be accounted for in your testing. Also, by containerizing your database as well as the rest of your app, you are more likely to be able to move your entire app between hosting providers (say from AWS to Google Compute). If you use Amazon RDS, for example, even if you can move your web nodes to Google, you won't be able to move your database, meaning that you are heavily dependent on your cloud provider.
Another reason for containerizing data services is performance. This is particularly true for service providers (all the database as a service offerings- e.g. rackspace cloud databases- run in containers), because containers allow you to provide service guarantees that aren't possible using virtualization, and running one database per physical machine is not financially viable. Chances are you aren't running a databases hosting service, but this analogy makes similar sense if you are running on bare metal and want to use containers for process isolation, instead of VMs. You'll get better performance for your databases because of the well-known i/o hit you take when running a db in a VM.
I'm not saying that you should containerize your database, but these are some of the reasons why it would make sense.
Full disclosure, I work for clusterhq, that new project that Mark O'connor mentioned in his answer. We have an opensource project called Flocker that makes it much easier to migrate databases and their volumes between hosts so that the benefits I mentioned above aren't completely outweighed by the negatives that you raised in your question.
Solution 2:
Not sure I'd agree with your comment on efficiency... It's a lot easier to download and run a database container, compared to installing it natively. The docker documentation describes how to implement a clean logical separation between a stateful container and its data:
- https://docs.docker.com/userguide/dockervolumes/
But... you are correct that a stateful container would be tied to it's host server, unless there is some mechanism to port the data around as well.
One obvious solution is to mount a shared storage volume on all the hosts that might be running your database.
The following article discusses a very innovative solution where a a bittorrent-like client is used to replicate a data container between hosts.
- http://www.centurylinklabs.com/persistent-distributed-filesystems-in-docker-without-nfs-or-gluster/
Finally a new project called flocker is attempting to solve this problem by managing both the stateful containers and their associated ZFS volumes:
- https://github.com/ClusterHQ/flocker