What's the major outage you've been part of?

Solution 1:

I'm 'part of' outages almost every single day (Monitor WAN links for 44 sites). The 'little ones' are the ones that are less than 5 minutes and most of the time go 'unnoticed' (The NOC only monitors outages higher than 5 minutes, for some reason). I try to communicate with the site to see if it was an internal issue and check router logs whenever the issue is 'unknown'.

I find Communication is key (and that's an understatement!) when dealing with outages. DO NOT WAIT TO BE CALLED as you're troubleshooting, or trying to find out what exactly happen. Make sure you communicate that you know they're down and you're working on it. Give them a time frame of when you will get back to them to give them updates on the situation (ETR). Don't let them hanging to think you have forgotten about them, make sure they KNOW someone is looking at their problem. You call them, so they don't have to call you.

Thankfully, the longest a site has been down under my watch has been 7 hours (this is within a work-day 10am-5pm). It should have been shorter by a few hours, if it hadn't been for the lack of good communication between all of the parties involved. Pretty much, the issue wasn't escalated properly, and due to the assumption that 'someone was working on it' the issue took (relatively for the site) forever to get resolved.

Solution 2:

We had a heating steam pipe that ran through our data center rupture. Very hot, condensation and asbestos insulation all over the place. Power cut for weeks during cleanup.

OK, my group's stuff was BGP paired, load balanced between multiple data center. We had some fraction of our users see a 30 second freeze before their current transaction was transferred. Many of the other projects saw outages of up to several days, with everyone putting in lots of overtime to help everyone else.

Lessons learned: Do your continuity planning first, then build your system to support your conclusions:

  • If you can't tolerate a week of downtime, plan and practice your transfer. Iinstead of primary/failover sites, have blue/gold and rotate every two weeks to ensure that everything is updated and available.
  • If you can't tolerate half an hour to a day or so, load balance between active sites. You will spend less time and effort setting it up than you will spend under pressure trying to do the recovery against the clock.
  • If you can't tolerate minutes of downtime, you need to go to a lot of effort for doing real High Availability. Best bet is to hire an expert consultant.
  • Just to finish out the hierarchy, if you can't tolerate seconds of downtime, you need specialized hardware as well as specialized design. You better be the expert

Solution 3:

I was attending a job interview at a company who happened to be currently facing an entire network outage in their 50+ user office. I solved it within minutes, and got to meet their current sysadmin and their IT support company they'd called in because he couldn't solve it - they'd spent all morning trying to work out what was going wrong.

The previous guy had installed two wireless routers in bridge mode, and plugged them both into the wired network. They were barely in range of each other, so they had a loop in their network which came and went as the reception varied.

Needless to say I got the job and then implemented some change management logging as soon as I started.

Solution 4:

I experienced a weeklong outage of our entire server network. We dealt with it by creating a redundancy network, to prevent that very same problem in the future, but while the outage was occurring, we used an old server that we had set up in a remote location. We've learned to always have a backup plan.