good failover / high availability solutions for linux? [closed]

Solution 1:

http://linux-ha.org/ for all your high-availability needs. Like the song says, the best things in life are free.

Solution 2:

I have used a variety of cluster solutions on Linux. I'm also a configuration management proponent, so I'll add a bit about that in my descriptions (Chef or Puppet, that is)

Veritas Cluster Server (VCS). It's been awhile, but we deployed a few Linux VCS clusters on RHEL 3.0. I would hope its available on RHEL 5.0. You should be familiar with the difficulty in setting this up, as its familiar territory. As you may be aware, VCS is expensive. Anecdotally, VCS is not well suited to being set up by configuration management.

Speaking of RHEL, Red Hat Cluster Suite has matured a lot since its original release with RHEL 2.1. The setup/configuration phase is pretty straightforward, and the documentation is very complete and helpful, and like VCS you can purchase support from the vendor. For commercial HA products, RHCS is reasonably priced. I would only use configuration management to install the packages, and maintain them "by hand" through the web interface. Also, I've heard of some people using it on non-Red Hat platforms, though I don't have experience with that directly.

Linux-HA (drbd/heartbeat) are great as well, though coming from VCS the configuration may seem simplistic, yet unwieldy. This is pretty easy to automate with a configuration management tool.

As a proof of concept, I've installed a Linux cluster with IBM's HACMP - their AIX clustering software. I would not recommend this, as I recall it is more expensive than even VCS. IBM has specific procedures for installing and maintaining HACMP, I would not use configuration management here.

Solution 3:

Michael is correct that the community is a bit fractured right now, and documentation is a tad sparse.

Actually, it's all there, it's just impossible to understand. What you really want is the "Pacemaker Configuration Explained" ebook... (Link to PDF). You'll want to read it about a dozen times, and then try to implement it, and then read it another dozen times so that you can actually grok it.

The best supported implementation of cluster services for Linux at this point is probably going to be Novell's SLES11 and it's High Availability Extension (HAE). It JUST came out a month or two ago, and it comes with a nice thick 200 page manual that describes how to set it up and get things running. Novell has also been excellent about supporting Pacemaker configurations in various forms.

Beyond that, there's RHEL5's implementation, which has the same package and decent documentation, but I think it's more expensive than SLES. At least, it is for us.

I would avoid Heartbeat right now and go with Pacekmaker/OpenAIS because they're going to be much better supported going into the future. HOWEVER, the current state of the community is such that there are a few experts, there are a few people who are running it in production, and there are a whole ton of people that are completely clueless. Join the Pacemaker mailing list and pay attention to a man named Andrew Beekhof.

Edit to provide requested details:

Pacemaker/OpenAIS uses a 'monitor' operation on a 'primitive resource' (e.g. nfs-server) to keep track of what the resource is doing. If the example NFS server goes unresponsive to the rest of the cluster for X number of seconds, then the cluster will execute a STONITH (Shoot The Other Node In The Head) operation to shut down the primary node, promoting the secondary node to active. You decide in the configuration what to bring up afterward and associated actions to take. Implementation details from there depend on what service you're trying to make fail over, execution windows for certain operations (such as promoting the primary node back to master) and the whole thing's pretty much as configurable as possible.