Documenting an outage for a post-mortem review
We had a rather serious outage this past week affecting several services which put us out of our SLA with customers. Now that everything has been resolved, I am conducting a post-mortem review.
From this review, I would like to come up with an internal document that describes the outage, its effects, our response and the resolution. I want to come up with a fairly standard form for future reuse. I have included my thoughts below, but what other items should be included? If this were a security-related incident, what would you add?
- Summary Executive level summary of event.
- Affected Services
- Impact What was the impact on our users and SLAs? Was there a cost in dollar terms, missed transactions, lost customers, etc?
- Outage Duration For each affected service if there were variances
- Cause Including primary and secondary causes
- Resolution
- Timeline of events Notifications, contact with external vendors, customer notifications, responses, etc.
- Problems with our response Did things not go as planned with our response to the outage? Correct people notified? Did vendors meet their contracted obligations?
- Preventative measures to take How do we prevent this outage from occurring again or reduce its impact?
- Detection Method How well did we detect this outage and how do we improve detection in the future?
- Changes to make in future outage responses
Try to keep posts down to one item and explanation, and this post can be updated with the top voted answers.
Solution 1:
Although it could be covered in the Preventative measures to take, I would recommend having a Detection method section that you could use to note what the true symptoms were and how you could detect the problem (faster) if it happens again, ideally using automation.
Solution 2:
Looks good. I would only add the following:
Effects/Consequences: What is the consequence of the outage - who was affected, which SLAs were violated (if any), were there any knock-on effects?