WHEN to put the contingency plan into action in case of a main server failure?

A framework you might want to use is two time windows for deciding this at the time of the problem. The end of the first time window will be a soft limit and the second will be a hard limit of when to switch over.

The soft limit will be a first point of cut over. If you have been trying to solve the problem but are nowhere closer to solving it than when you started you would switch at the soft limit. If you think you are getting close to solving the problem at the soft limit you would then keep going until the hard limit. So the soft limit would be 5 minutes for example, and the hard limit will be maybe 8 minutes from the start of trying to fix the problem. At the hard limit, you switch over no mater what.

The length of the windows you use you are going to have to decide for yourself. You also have to figure out if you want to include the amount of time it takes before you actually start looking at the problem.

You also could of course just wing it and do what you think is best at the time -- it is likely okay not to plan every last little detail.


It's all about costs. What does it cost to try and fix the problem for X minutes/hours? Is it less than the cost of switching to a backup server, lose some date, and eventually move back to the main production server?

Once the cost of trying to fix exceeds the cost of switching, the decision is made, switch. Until you have a handle on the costs, how can you define a "disaster"?