High server availabilty for a small business

After having a bit of scare with a server that wouldn't come up one morning, the higher ups have decided that the business needs a high availability / fail over setup.

We have 5 main servers (4x Linux, 1x OpenBSD) all of which need to be running for the company to operate. Three of the servers are fairly standard (Files/Web/Database), the fourth handles most network routing and web proxies, while the fifth supports our phone system and has non-standard hardware.

My boss has stated that turn around time for a server failure should be under 30 minutes.

My experience in this field is non-existent (I'm just a programmer who was 'promoted'), so I guess my question really boils down to:

  • Is this something that should even be attempted by someone with average server-admin skills. If so, what should I read, and who should I talk to?

Thanks.


I think you should start by getting numbers together to describe the cost associated with fulfilling the stated "requirement" to see if it even falls within the budget. If you're not comfortable with all of the "normal" methods that would be used to fulfill the requirement (failover clustering, hypervisors with "hot migration" capability, etc), then you'd probably do well to find a consultant who can help out.

There's going to be some cost associated with the feasibility study, but it's going to cost a lot less to discover that a good solution won't fit within the stated requirement (meaning that expectations need to be set more realistically by management-- or they need to pony up more money) than it will cost to do something half-assed that ends up not fulfilling the requirement at all and blowing a ton of money in the process.

It sounds like your boss just pulled that number out of the air. Perhaps he's done some analysis and knows what the cost-per-hour associated with downtime of various systems is, but I doubt it. It sounds like some pie-in-the-sky number that isn't tied to reality. I'd be surpirsed if all your systems need that kind of availability. It may be, in the course of studying the business, that you discover that only a subset of functionality needs to have such a degree of uptime and fault-tolerance (and, thus, such a solution would ultimately cost less). I'm sure that phones and the line-of-business application are up there, but you may have some tolerance for downtime on some of the other systems.

My gut says that you're probably going to find a win in using virtualization technologies to create a failover system based on migration of virtual machines between redundant hardware. Whether it'll fit your budget or not will depend on your business, since you'll definitely need some type of SAN to make that work effectively.

Don't discount "traditional" failover clustering, though. There are definitely "wins" there, too, if your applications are well suited to such a configuration.

I wonder if your boss has thought about catastrophic failure scenarios (building burns, flood, tornado, theft, etc). If that's not already planned-for, this would be a golden opportunity to work in some general business continuity planning and disaster recovery contingency.

Get some help from somebody who can come in and study your business and make recommendations. You won't regret it.


"This road leads to much pain and hurt..."

So, what is your Business's Continuity Plan? You Disaster Recovery plan?

Have you discussed it? Written it down? TESTED IT?

You need to have a proper conversation with the "higher up's" and really get to the bottom of the requirements for high availability because it is different for different services.

So what really was the "pain point" that they felt that morning?

Was it?

  • Telephones stopped working? Fairly major (and visible) problem. And yes - this will need a "solution" but hopefully this is under a support agreement?
  • Web site failed? OK - Fairly visible but not unexpected, and unless you have a HUGE web presence then not that important. OK to have this server down for a few hours.
  • Database server down? Scary... Hope you got good backups! Don't lose the data otherwise he business WILL fail. But, as long as the data is secure then it's a server that is important and should have a recovery plan.
  • File and print (and internal apps etc). This is PITA for most people as they will sit around and do nothing for a morning as you fix it.

I assume you have bought high quality hardware for your main systems? Good, 'cause to cheap out on hardware is a false economy as these servers come with "dual" everything in the box.

I will also assume you know HOW to rebuild a server, swap fans, power supplies, rack a server, configure dual path networks into redundant switches? You've done this enough times to understand what works and what doesn't, what is normal and what is erronous? If not then get help and training (or at least practice and experience).

Maybe a lot of the problem was FEAR. They did not have a clue that such a problem could happen (and how important the servers were to their business) and you didn't really know what you were doing (?) A confidence issue?

You need to get all the above right BEFORE going down the very expensive HA route. Can the business afford this expensive equipment (and most of it, by definition, will only ever be used in a failure and often never used!)