Mathematically, how to calculate an uptime percentage based on a number of nodes and their respective uptime percentage?
Uptime is a slippery thing... If you want to calculate the availability of a service then it is simply
amount of time service is available
----------------------------------- x 100
amount of time that has passed
If you have a cluster providing the service, then the likelyhood that the service becomes unavailable does down but the availability (uptime) calculation for the service stays the same.
The chance of one server being offline is (1 - 0.95) The chance of both servers being offline is (1 - 0.95) * (1 - 0.95) = 0.0025 etc...
So using your model and from a purely mathematical point of view one or both of the servers should be up 99.75% of the time
However, I'm not sure that using such a mathematical model is the correct way to work out your potential uptime as there are other factors that may affect it which are common to both servers i.e. the 95% might be because 5% of the time there is a power cut whihc would affect BOTH servers so having a cluster would make no difference
This depends on why your servers are down 5% of the time. If you have power 95% of the time, but your servers are otherwise flawless, then a second server at the same location does not increase your uptime at all: if one goes down, both go down. This is an example of the failures being correlated. It's likely that at least some of your downtime is due to errors that affect all servers together (power...). But some of the downtime will be independent between servers. If you want to do it properly, you ought to deal with these things separately. So you want to work out the probability that server 1 does not have an independent error (p) and that server 2 does not have an independent error (q) and that there is no systemic error that kills both (r). It would be relatively safe to assume that these errors are independent, and thus you could just multiply them together: pqr is probability of some server being up.
The problem is, you can't use actual uptime data to give you values for p,q, and r, except that if you have just server 1 and it is up 95% of the time, then p*r = 0.95.
First of all, the total availability or uptime of a cluster depends on how large a part of the cluster is needed to be active for the whole cluster to be considered 'up'.
- Is one functioning machine enough? That would mean that any single machine can take the full load if needed.
- Do all of them need to be active at the same time? That is, there is no redundancy.
- Or perhaps two out of three online are sufficient? This would allow for a larger workload than the first case.
As you found out, the first two cases are quite simple to calculate. Let the probability of a single server being online at any given time p = 0.95. Now, for three servers, the probability that they are all online at the same time is p3 = 0.857375.
For the opposite case, where at least one machine should be active at a given time, it's easier to calculate by inverting the problem and looking at the probabilities of the machines being offline. The probability that a single machine is offline is q = 1-p = 0.05, and hence the probability that they are all down at the same time is q3 = 0.000125, giving probability 1-q3 = 1-(1-p)3 = 0.999875 that at least one is up.
The 2 out of 3 case is slightly harder to calculate. There are four possible situations where at least two out of three servers are up. 1) ABC are up, 2) AB are up, 3) AC are up, 4) BC are up. The probabilities for all these are, respectively, ppp, ppq, pqp and qpp. Since the cases are disjoint, the probabilities can be added together, giving a total A = p3 + 3 p2q = 0.992750.
(This can be expanded to more machines. The factors are the well known binomial coefficients, so counting the different cases by hand works mostly as an exercise.)
Of course, calculations like this are much easier to deal with by using a ready-made computer program... At least one online calculater can be found here:
http://stattrek.com/online-calculator/binomial.aspx
Entering the input values: probability of success = 0.95, number of trials = 3, number of successes = 2, we get the result "Cumulative Probability: P(X ≥ 2) = 0.99275". Some other related values are also given, and the online tool makes it easy to play with other numbers too.
And yes, all of the above assumes that the servers fail independently, that is a) I ignored any problems affecting the cluster as a whole, b) there isn't anything like component aging that would make it likely for the servers to fail at or nearly at the same time.