100% uptime for a web application

Solution 1:

Here is Wikipedia's handy chart of the pursuit of nines:

enter image description here

Interestingly, only 3 of the top 20 websites were able to achieve the mythical 5 nines or 99.999% uptime in 2007. They were Yahoo, AOL, and Comcast. In the first 4 months of 2008, some of the most popular social networks, didn't even come close to that.

From the chart, it should be evident how ridiculous the pursuit of 100% uptime is...

Solution 2:

Ask them to define 100% and how it will be measured Over what time period. They probably mean as close to 100% as they can afford. Give them the costings.

To elaborate. I've been in discussions with clients over the years with supposedly ludicrous requirements. In all cases the they were actually just using non precise enough language.

Quite often they frame things in ways that appear absolute - like 100% but in actual fact on deeper investigation they are reasonable enough to do the cost/benefit analyses that are required when presented with costings to risk mitigation data. Asking them how they will measure the availability is a crucial question. If they don't know this then you are in a position having to suggest to them that this needs to defined first.

I would ask the client to define what would happen in terms of business impact/costs if the site went down in the following circumstances:

  • At their busiest hours for x hours
  • At their least busy hours for x hours

And also how they will measure this.

In this way you can work with them to determine the right level of '100%'. I suspect by asking these kinds of of questions they will be able to better determine their other requirements' priorities. For example they may want to pay certain levels of SLA and compromise other functionality in order to achieve this.

Solution 3:

Your clients are crazy. 100% uptime is impossible no matter how much money you spend on it. Plain and simple - impossible. Look at Google, Amazon, etc. They have nearly endless amounts of money to throw at their infrastructure and yet they still manage to have downtime. You need to deliver that message to them, and if they continue to insist that they offer reasonable demands. If they don't recognize that some amount of downtime is inevitable, then ditch 'em.

That said, you seem to have the mechanics of scaling/distributing the application itself. The networking portion will need to involve redundant uplinks to different ISPs, getting an ASN and IP allocation, and getting neck-deep in BGP and real routing gear so that IP address space can move between ISPs if need be.

This is, quite obviously, a very terse answer. You haven't had experience with applications requiring this degree of uptime, so you really need to get a professional involved if you want to get anywhere close to the mythical 100% uptime.