Questions about single point of failure for small operations

  1. If you can't afford or don't need a cluster or spare server waiting to come online in the event of a failure, it seems like you might split the services provided by one beefy server onto two less beefy servers. Thus if Server A goes down, clients might lose access to, say email, and if Server B goes down, they might lose access to the ERP system.

    While at first this seems like it would be more reliable, doesn't it simply increase the chance of hardware failure? So any one failure isn't going to have as great an impact on productivity, but now you're setting yourself up for twice as many failures.

    When I say "less beefy", what I really mean is lower component spec, not lower quality. So one machine specification out for visualization vs two servers spec'd out for less load each.

  2. Often times a SAN is recommended so that you can either use clustering or migration to keep services up. But what about the SAN itself? If I was to put money on where a failure is going to occur, it's not going to be on the basic server hardware, it's going to have something to do with storage. If you don't have some sort of redundant SAN, then those redundant servers wouldn't give me a great feeling of confidence. Personally for a small operation it would make more sense to me to invest in servers with redundant components and local drives. I can see a benefit in larger operations where the price and flexibility of a SAN is cost effective. But for smaller shops I'm not seeing the argument, at least not for fault tolerance.


Solution 1:

This all boils down to risk management. Doing a proper cost/risk analysis of your IT systems will help you figure out where to spend the money and what risks you can or have to live with. There's a cost associated with everything...this includes HA and downtime.

I work at a small place so I understand this struggle and the IT geek in me wants no single points of failure anywhere but the cost of doing that at every level is not a realistic option. But here are a few things that I've been able to do without having a huge budget. This doesn't always mean removing the single point of failure though.

Network Edge: We have 2 internet connections a T1 and Comcast Business. Planning on moving our firewall over to a pair of old computers running pfSense using CARP for HA.

Network: Getting a couple of managed switches for the network core and using bonding to split the critical servers between the two switches prevents a switch failure from taking out the entire data closet.

Servers: All servers have RAID and redundant power supplies.

Backup Server: I have an older system that isn't as powerful as the main file server but it has a few large sata drives in raid5 which takes hourly snapshots of the main fileserver. I have scripts setup for this to switch roles to be the primary file server should it go down.

Offsite Backup Server: Similar to the onsite backup we do nightly backups to a server over a vpn tunnel to one of the owners house.

Virtual Machines: I have a pair of physical servers that run a number of services inside of virtual machines using Xen. These are running off a NFS share on the main file server and I can do live migration between the physical servers if the need arises.

Solution 2:

I think this is a question with many answers but I would agree in many smaller shops the several server solution works and as you say, at least something keeps going if there is a failure. But it depends on what fails.

Very hard to cover all bases but redundant power supplies, good quality power and good backups can help.

We have used Backup Exec System Recovery for some critical systems. Not so much for daily backup but as a recovery tool. We can restore to different hardware, if available, and we also use the software to convert the backup image to a Virtual Machine. If the server fails and we need to wait for hardware repairs, we can start a VM on a different server or workstation and limp along. Not perfect but it can be up and running quickly.

Solution 3:

Regarding SANs: Almost anything you use will be redundant. Even if it's a single enclosure, inside will be dual power supplies, dual connectors, and dual 'heads', each with links to all disks. Even something as simple as an MD3000 sold by Dell has all these features. SANs are designed to be the core of your boxes, so they're built to survive just about any random hardware failure.

That being said, you have a point that redundancy isn't always the best option. ESPECIALLY if it increases complexity. (and it will) A better question to ask is..."How much will the company accept downtime". If the loss of your mailserver for a day or two isn't a big deal, then you probably shouldn't bother with two of them. But if a webserver outage starts losing you real money every minute, then maybe you should spend the time making a proper cluster for it.

Solution 4:

The more servers you have the more chances of something breaking, thats one way of looking at it. Another is if one breaks, you're up the creak 100%, also just like you are saying.

The most common hardware failure is HDs, like you were saying above. Regardless of how much you want to split operations between, you need to be RAIDing you storage.

I would vote for a couple servers (RAIDed of course) instead of one massive one, both for operations stability, and performance. Less software bumping into each asking for resources, reduced clutter, more disks to be read/written to, and so on.

Solution 5:

I would personally opt for multiple servers. I don't think equipment failure is more likely in this scenario. Yes, you have more equipment that could fail, but the odds of any given unit failing should be constant.

What having multiple servers in a non-redundant/non-HA configuration gives me is the ability to off-load some of the work to another server in the event of a failure. So, say my print server goes down. If I can map a few printers to the file server while I'm fixing the print server, the impact to operations is lessened. And that's where it really matters. We often tend to talk about hardware redundancy, but the hardware is only a tool for continuity of operations.