Scaling solutions for MySQL (Replication, Clustering)
I've been doing A LOT of reading on the available options. I also got my hands on High Performance MySQL 2nd edition, which I highly recommend.
This is what I've managed to piece together:
Clustering
Clustering in the general sense is distributing load across many servers that appear to an outside application as one server.
MySQL NDB Cluster
MySQL NDB Cluster is a distributed, in-memory, shared-nothing storage engine with synchronous replication and automatic data partioning (excuse me I borrow literally from the High Performance book, but they put it very nicely there). It can be a high performance solution for some applications, but web application generally do not work well on it.
The major problem is that beyond very simple queries (that touch only one table), the cluster will generally have to search for data on several nodes, allowing network latency to creep in and significantly slow down completion time for queries. Since the application treats the cluster as one computer, it can't tell it which node to fetch the data from.
In addition, the in-memory requirement is not workable for many large databases.
Continuent Sequoia
This is another clustering solution for MySQL, that acts as a middleware on top of the MySQL server. It offers synchronous replication, load balancing and failover. It also ensures that requests always get the data from the latest copy, automatically choosing a node that has the fresh data.
I've read some good things on it, and overall it sounds pretty promising.
Federation
Federation is similar to clustering, so I tugged it here as well. MySQL offers federation via the federated storage engine. Similar to the NDB cluster solution, it works well with simple queries only - but even worse the the cluster for complicated ones (since network latency is much higher).
Replication and load balancing
MySQL has the built in capacity to create replications of a database on different servers. This can be used for many things - splitting the load between servers, hot backups, creating test servers and failover.
The basic setup of replication involves one master server handling mostly writes and one or more slaves handling reads only. A more advanced variation is that of the master-master configuration, which allows to scale writes as well by having several servers writing at the same time.
Each configuration has its pros and cons, but one problem they all share is replication lag - since MySQL replication is asynchronous, not all nodes have the freshest data at all time. This requires the application to be aware of the replication and incorporate replication-aware queries to work as expected. For some applications this might not be a problem, but if you always need the freshest data things get somewhat complicated.
Replication requires some load balancing to split the load between the nodes. This can be as simple as some modifications to the application code, or using dedicated software and hardware solutions.
Sharding and partioning
Sharding is commonly used approach to scale database solutions. You split the data into smaller shards and spread them around different server nodes. This requires the application to be aware of the modification to the data storage to work efficiently, as it needs to know where to find the information it needs.
There are abstraction frameworks available to help deal with data sharding, such as Hibernate Shards, an extension to the Hibernate ORM (which unfortunately is in Java. I'm using PHP). HiveDB is another such solution which also supports shard rebalancing.
Others
Sphinx
Sphinx is a full-text search engine, that can be used for far more than test searches. For many queries it is much faster than MySQL (especially for grouping and sorting), and can query remote systems in parallel and aggregate the results - which make it very useful in use with sharding.
In general sphinx should be used with other scaling solutions to get more of the available hardware and infrastructure. The downside is that again you need the application code to be aware of sphinx to use it wisely.
Summary
Scaling solutions differ depending on the needs of the application that needs it. For us and for most web-applications, I believe that replication (probably multi-master) is the way to go with a load balancer distributing the load. Sharding of specific problem areas (huge tables) is also a must for being able to scale horizontally.
I'm also going to give a shot to Continuent Sequoia and see if it can really do what it promises to since it will involve the least amount of changes to application code.
Disclaimer: I have not used MySQL Cluster, so I'm only going from what I've heard.
MySQL Cluster is an HA (high availability) solution. It's fast, because it's all in memory, but it's real selling point is the availability. There is no single point of failure. With replication, on the other hand, if the master goes down, you have to actually switch to the replica, and there may be a small amount of down time. (although the DRBD solution is another alternative that has high availability)
Cluster requires that your entire database fit in memory. That means that each machine in the cluster needs to have enough memory to store the entire database. So this is not a feasible solution for very large databases (or at least it's a very expensive solution).
I think that unless HA is super important (read: probably not), it's more hassle (and money) than it's worth. Replication is more often the better way to go.
Edit: I forgot to mention also that Cluster does not allow foreign keys, and range scans are slower than on other engines. Here is a link that talks about Known Limitations of MySQL Cluster
There are some good discussions about how the folks that maintain drupal.org have structured their database servers:
- Dries Buytaert's blog
- WorkHabits blog
Both are from 2007, so the Clustering support may be stronger now, but at the time they chose replication.