How far should we take the N+N redundancy craziness?

The industry standard when it comes from redundancy is quite high, to say the least. To illustrate my point, here is my current setup (I'm running a financial service).

Each server has a RAID array in case something goes wrong on one hard drive

.... and in case something goes wrong on the server, it's mirrored by another spare identical server

... and both server cannot go down at the same time, because I've got redundant power, and redundant network connectivity, etc

... and my hosting center itself has dual electricity connections to two different energy providers, and redundant network connectivity, and redundant toilets in case the two security guards (sorry, four) needs to use it at the same time

... and in case something goes wrong anyway (a nuclear nuke? can't think of anything else), I've got another identical hosting facility in another country with the exact same setup.

Cost of reputational damage if down = very high
Probability of a hardware failure with my setup : <<1%
Probability of a hardware failure with a less paranoiac setup : <<1% ASWELL
Probability of a software failure in our application code : >>1% (if your software is never down because of bugs, then I suggest you doublecheck your reporting/monitoring system is not down. Even SQLServer - which is arguably developed and tested by clever people with a strong methodology - is sometimes down)

In other words, I feel like I could host a cheap laptop in my mother's flat, and the human/software problems would still be my higher risk.

Of course, there are other things to take into consideration such as :

scalability
data security
the clients expectations that you meet the industry standard

But still, hosting two servers in two different data centers (without extra spare servers, nor doubled network equipment apart from the one provided by my hosting facility) would provide me with the scalability and the physical security I need.

I feel like we're reaching a point where redundancy is just a communcation tool. Honestly, what's the difference between a 99.999% uptime and a 99.9999% uptime when you know you'll be down 1% of the time because of software bugs ?

How far do you push your redundancy crazyness ?

Solution 1:

When the cost of the redundancy is higher then the cost of being down while what ever is broken is being replaced, it's to much redundancy.

Solution 2:

Its all abut risk management. Even with 2x everything, you can still get downtime due to unforseen problems.

eg. My hosting provider has dual, redundant connections to the upstream internet. So the day that one of their cables was cut through by some building contractors, their upstream provider took the other one down for some maintenance. And not only that, because all the phones were SIP, no-one could phone in to say there was no connectivity and they didn't realise there was a problem for ages.

Now that was a one in a million cock-up, and it could have been prevented by adding in more layers of redundancy or management oversight... but the chance of it happening was so slim, you'd never think that there'd be a problem so it wouldn't be worth the cost of preventing it from happening.

Another example: we implemented SQL Server mirroring at an Ambulance 999 control room, mirrored DBs should have meant there would be no problem.. except that we found a bug in SQLServer that froze the main DB and prevented it failing over to the mirror. So, although we did what we could to ensure continuous uptime, we still had to transfer to manual calltaking while the DB issue was resolved. In this case, we had the best solution we could reasonably implement, and a fallback plan in case that 'best solution' failed. Trying to ensure a total 100% uptime guarantee for the 'best solution' simply would not have been cost effective, and probably would still have not given us that 100% guarantee anyway.

Again, another story: we have a europe-wide network of replicated Active Directory servers, with fallback in case of failure in any country. So when a certain admin accidentally deleted a few too many records, the solution was to stop the server and let people authenticate against the next country along. Only the replication got there first and the deleted records started to be deleted from the other servers too.... took a week, with Microsoft expert help to get things resolved fully.

So - its all down to risk/cost. You decide how much risk you're willing to take, and cost it. It quickly gets to a point where reducing risk further costs too much, at that point you should find alternative strategies to cope with the downtime when it happens.

Solution 3:

You're doing what I do - I don't think it's crazy at all.

Solution 4:

... and in case something goes wrong anyway (a nuclear nuke? can't think of anything else), I've got another identical hosting facility in another country with the exact same setup.

As the others have noted: This is simply a business case. The level of redundancy required is dictated directly by the requirements and expectations of your clients/users. If they pay for and expect uptime in the region of five-9s then you need to provide that. If they don't, then you should address that as a business strategy.

However, if I try to guesstimate the probability of another problem (software or human), I think it's several order of magnitudes higher than that.

Simple answer: This has to be addressed by procedure. Not by physical redundancy.

If human error is causing you downtime then you need to strenghten the error checking performed whenever humans intervene. This probably means that all platform amendments are ticketed as change requests and signed off by a secondary person. Or that those change requests contain more detail about tasks to be undertaken and no deviation can be taken. Or that staff simply require more training about how to work with care in production environments.

If software error is causing you downtime then perhaps you need to strengthen your staging procedure. Ensure that you have a good staging environment, which may well be entirely virtualised to reduce the hardware requirements, but still matches your production environments as closely as possible. Any software changes should get tested in the staging environment for a specified period of time before they are rolled for general deployment.

Solution 5:

Every design and architecture should be requirements driven. Good systems engineering calls for defining the constraints of the design and implementing a solution that meets that. If you have a SLA with your customers that calls for a .99999, then your solution of N+N redundancy should account for all those LRU (line replaceable units) that could fail. RAID, PS, and COOP planning should all account for that. In addition your SLA with vendors should be the 4 hour response time type or account for a large number of spares onsite.

Availability (Ao from here out) is that study. If you are doing all these things because it seems like the right thing to do then you are wasting your time and your customers money. If pressed, everyone would desire 5x9's, but few can afford this. Have an honest discussion about the availability of the data and system in the perspective of cost.

The questions and answers posed thus far do not take into account the requirements. The chain assumes that N+N redundancy with hardware and policies is the key. Rather, I would say let the requirements from your customers and SLA drive the design. Maybe your mom's flat and your old laptop will suffice.

Us geeks sometimes go looking for a problem just so we can implement a cool solution.