How to manage communication during application downtime?

I've had a lot of experiences lately with application downtime, from both vendors and my own applications. This has got me thinking and as best I can google there isn't really a good or standard way of managing customer communication during downtime incidents.

I've seen this handled a lot of ways from the "blame everyone but us" approach to the "we screwed up and we're sorry" approach.

So my questions are... when you screw up with an app and cause downtime:

  1. Do you admit fault immediately? (Should you, legally?)
  2. How much info do you give the customer regarding what went wrong? ("An issue" vs. "A code syntax error in one of our SQL queries")
  3. Do you come back with a follow-up prevention plan, or just leave it at "this has been resolved"?
  4. Do you provide real-time updates? How often? Via Twitter or public-facing website?

Any other best practices for this that you've found successful?


Solution 1:

Here is what I do:

  • State very clearly what the consequences are (right now and in the immediate future). Highlight likely permanent consequences or lack thereof (data loss, loss of employee-hours).
  • Keep the tone very neutral. Do not spend energy on blame/guilt. Ideally this conveys "I want to give you information but my attention is also needed elsewhere".
  • Your notification will be forwarded to a lot of people, make sure your CEO understands the gist within the first half paragraph. Usually I provide an 'executive summary'. Technical details can provide background information to other technical people.
  • Provide contact details (preferably someone who has the time in the heat of the downtime) for further questions, and ask patience in the same sentence (this works often).
  • Promise updates when the situation changes.

Send updates when there is good news, before office closing time ("all staff will continue through the night" - account for timezones if necessary) and again around office opening time.

When the issue is resolved (for any definition of that word), send:

  • A summary including timing of the consequences
  • The actions/changes taken on short term and planned for the future ("lessons learned"); based on:
  • Technical root cause analysis

Keep any calls for blame, guilt or lynching in separate mails, preferably after some cooldown time.

Do not commit to anything during the downtime unless you are really, really sure you can deliver. Somehow two separate "bad news" situations are worse than a long one.

I prefer to use a medium where a notification is pushed on every message (mail, Twitter, ..)

Solution 2:

The most important thing I've found both as a service provider and a service user is proactive responsibility. It's not able what you say, but when (how soon) you say it.

If you are notified that a problem happened and was fixed (or is being worked on), it's a lot better than discovering the problem yourself and trying to contact the vendor to figure out what in the world is going on. It also helps with the blame game and saves a lot of troubleshooting time (is it us or is it them?).

As far as detail goes, I find that giving a simple summary of what happened is nice unless users specifically request more information. There will be some people who always want as much detail as they can get, but most people just want things to work (even if they are highly technical).

Lastly, being able to explain what steps you have taken so that it won't happen again go a long way toward future goodwill and trust.