What's your checklist for when everything blows up?

Users can't get to their e-mail, the CEO can't get to the company's home page, and your pager just went off with a "911" code. What do you do when everything blows up?


Solution 1:

The first answer is stay calm! I learned that the hard way that panicking often just makes things worse. Once thats achieved the next thing is to actually ascertain what the problem is. Complaints from users and managers will be coming at you from all angles, telling you what THEY cannot do, but not what the problem is.

Once you know the problem you can start the plan to fix it and start giving your angry users a timescale!

Solution 2:

Stay Calm

Don't freak out. Breathe! (From the diaphragm, it helps.) If you've studied meditation, that can help too.

When faced with extreme stress your body will go into a flight-or-fight mode, because your body thinks it's in a life-or-death situation. At this time your body will actually pump less blood to some parts of your brain, lessening functions like reasoning. This effectively lowers your IQ as instinct, instead of rationality, begins to dominate your brain functions. If you've ever been in or witnessed a heated argument you may recognize these symptoms as peoples' emotions flare and rationality takes a holiday. Later, when people have a chance to cool down they will be more likely to accept having made a mistake or having been wrong, and are more capable of seeing the other side, but in the heat of the moment, less so.

Maintaining your composure and keeping your wits about you will keep your brain functioning at full capacity and ensure you make rational decisions based on evidence and reason rather than emotion and fear.

Triage

Efficient application of limited resources to achieve the greatest benefit at the lowest cost is supremely important here. Make the decisions as early as possible which things have to be fixed RIGHT NOW, which can wait a little while (hours, days), and which can wait indefinitely. Also learn to realize when something is unsalvageable and not worth saving (e.g. half the router melted, even if it's your only one, you can't save it, buy a new one and get it on site post-haste or find something that can fill the gap temporarily).

Retain Situational Awareness

Don't allow your attention to be trapped by some interesting problem or by something you don't quite understand yet. Keep focused on the big picture and on getting the most important things working.

Use the Scientific Method

Form a hypothesis. Determine how you would test this hypothesis. Gather data to test the hypothesis. Look for dis-confirming data as well. Refine your hypothesis and repeat the cycle as many times as necessary until you have enough confidence in your hypothesis to take action.

Be Pragmatic

Now is not the time for dogma. It's ok to take a few shortcuts here and there when recovering from disaster. This is essentially accruing technical debt. At many companies, catastrophic failure means catastrophic loss of revenue. It's better to get things running, even if on a shaky footing, than to dilly-dally and risk the livelihood of your company. As always, judgement is supremely important here. Sometimes it makes sense to prop up a box fan pointed at a server rack, sometimes it doesn't.

Look After Yourself

How long have you been working on this emergency? When was the last time you had a drink of water? When was the last time you've eaten? How long have you been awake? Don't burn yourself out just because there's an emergency, take the time to keep hydrated, fed, and rested (in case it's a long, multi-day slog).

Recruit Help

There are almost certainly many talented folks in your company who are both motivated and capable of lending help. Be wary of getting too many people running around and causing trouble for each other though. Also be wary of annoying people by putting them through a "firedrill". Find people who want to help already, get them working on targeted tasks, and make sure people are communicating with each other.

Communicate

Communication is critical. Nothing is as scary as the unknown. When people know nothing other than that something is broken, an empty statement that it'll be back up in X hours is only mildly reassuring (even less reassuring after X hours have passed and things are still broken). The pressures at play can steer you toward giving overly optimistic WAG time-estimates, but this is the wrong course. Don't just say you're working on it, don't just say things will be fixed by X time. Be open, show your process, detail your progress and your setbacks. Provide insight into the problem, your process in tracking it down, and your plan for fixing things (though don't drown people in minutiae). Show that the problem is not intractable, show that things will be made right eventually, show that there are competent people on the problem, these things are more reassuring than baseless time-schedule promises.

Solution 3:

Don't Panic.

Solution 4:

Step 0. Check that it's not your monitoring system that is at fault

Solution 5:

Immediately book a flight to a non-extradition country