When to employ server failover in a virtual environment

This question kind of got me thinking about fault tolerance in DHCP, so I did a little digging in my current environment and discovered that we only have 1 DHCP server per major site in our company with no redundancy. All of our DHCP servers are virtual with VMWare high availability and regular backups using Quantum VMPro, so in the event of almost any catastrophic crash of our DHCP servers we can still recover inside of an hour.

This would lead me to think that a redundant DHCP server for failover is, well, redundant. But most of my prior experience is in the small business sector where this kind of situation just never comes up. Big business is very different.

Most of our file servers are in the same configuration, except for the few remaining physical server clusters that haven't gotten caught in our virtualization efforts yet.

So in a virtual environment, what are the decision points for adding server redundancy? Examples: When would I add a virtual DHCP standby server? Or create a virtual failover cluster for file servers? I understand that this is probably difficult to answer without enumerating the specific needs of an organization, but I think it's possible to describe a few example situations that would help an SA to be prepared before the need arises.

I'm strictly concerned about fault tolerance and failover - load balancing in this context is totally unrelated.


As always in life - and especially in IT, the answer is "it depends".

On that very specific use case you have, with a virtualized environment, VMware HA - it does not really need a standby then - , but still DHCP as a very "light" service, my suggestion is to just spin up DHCP on another VM (or even another existing VM), and have them in a DHCP Failover configuration if you have 2012+, or have them in a "Split Scope" configuration.

Refer to Understand and Deploy DHCP Failover on TechNet

For the other examples (e.g. FileServer Cluster etc), you need to evaluate some of the following:

  • How critical is the service
  • What does it cost the business if the service is down
  • What does it cost IT to keep the service redundant
  • How easy is it to deploy redundancy
  • What maintenance costs are associated with keeping it redundant (manpower)
  • Are your other redundancy measures (e.g. VMware HA) already "good enough"