Can anyone tell me if it is possible to pool several physical servers to run a resilient virtualization environment. Our servers are getting more and more critical to our clients and we want to do everything we can to improve resiliency in the event of a hardware failure. I have used desktop VMs but I am not familiar with what is possible in enterprise level VMs.

The ideal would be to have a few physical servers in our datacenter. A few VMs would be shared among these to run a web server, application server, and database server. If one physical server failed, the VMs should switch to one of the other servers and continue running without any interruption.

Can this be accomplished? I realise that even Google goes down from time to time, so I am not looking for perfection; just an optimal solution.


Solution 1:

It doable, and we do something similar, just without the automatic part.

As @ewwhite pointed out, the key is having a shared storage pool that visible to multiple host servers, so if one host goes down, it doesn't much matter a lot, because another host can take over. Setting up the kind of unnoticeable, interruption-free automatic failover you're asking about is not easy (or cheap), and frankly a lot more trouble than it's worth, at least for the vast majority of use-cases out there. Modern hardware doesn't fail a lot, unless it's set up really badly, so you'll get more mileage out of making sure it's set up right and in an environment that's within the operational ranges of the equipment.

We use the fail-over and high availability functions of our systems for only two things, really. The first is in disaster recovery (if our main site loses power or explodes, or what have you, we have the critical parts are mirrored at a second facility) and the second is in avoiding maintenance windows. We use blade servers, and ESX/vSphere and between having the ability to fail-over to a secondary site, and the ease of using vMotion to move VMs between hosts, there's very little that we can't do without a service interruption.

I would focus on getting that set up first - once you're able to (manually) fail things around to where-ever, you may decide that getting it work automatically is more expensive and difficult than its worth. It sounds easy enough and great in theory, but in practice it can be a real pain to get everything working properly in clusters or in a distributed-guest set up.

Solution 2:

This is an excellent reason to virtualize. As application availability, rather than individual (physical) server uptime, become more important to businesses, many organizations find that they can attain a higher level of reliability through virtualization.

I'll use VMWare and Xen as examples, but with some form of shared storage that's visible to two or more host systems, virtualized guests can be distributed and load-balanced across physical servers. The focus begins to be the quality of the shared storage solution, management and the networking/interconnects in the environment.

However, one bit of caution... You should evaluate what type of hardware and environmental situations pose a threat. Quality server-class equipment includes many redundancies (fans, power supplies, RAID, even RAM)... Modern hardware does not just fail often. So avoid overreacting by building an unnecessarily-complex environment if spec'ing higher-end servers could help eliminate 90% of the potential issues.