Documenting an outage for a post-mortem review

We had a rather serious outage this past week affecting several services which put us out of our SLA with customers. Now that everything has been resolved, I am conducting a post-mortem review.

From this review, I would like to come up with an internal document that describes the outage, its effects, our response and the resolution. I want to come up with a fairly standard form for future reuse. I have included my thoughts below, but what other items should be included? If this were a security-related incident, what would you add?

  • Summary Executive level summary of event.
  • Affected Services
  • Impact What was the impact on our users and SLAs? Was there a cost in dollar terms, missed transactions, lost customers, etc?
  • Outage Duration For each affected service if there were variances
  • Cause Including primary and secondary causes
  • Resolution
  • Timeline of events Notifications, contact with external vendors, customer notifications, responses, etc.
  • Problems with our response Did things not go as planned with our response to the outage? Correct people notified? Did vendors meet their contracted obligations?
  • Preventative measures to take How do we prevent this outage from occurring again or reduce its impact?
  • Detection Method How well did we detect this outage and how do we improve detection in the future?
  • Changes to make in future outage responses

Try to keep posts down to one item and explanation, and this post can be updated with the top voted answers.


Solution 1:

Although it could be covered in the Preventative measures to take, I would recommend having a Detection method section that you could use to note what the true symptoms were and how you could detect the problem (faster) if it happens again, ideally using automation.

Solution 2:

Looks good. I would only add the following:

Effects/Consequences: What is the consequence of the outage - who was affected, which SLAs were violated (if any), were there any knock-on effects?