Our incident postmortem template

August 15, 2019
Engineering

In the final part of our SRE process series, we share our internal postmortem template with some pointers on the review process, what to include in each section, plus best practice examples. This follows our recent postmortem guidelines post, that looks at things to keep in mind before putting pen to paper. If you haven’t read it already, make sure to check that out first!

Writing as a two-step process

At Hosted Graphite, our customers are highly technical and appreciate a high level of detail. So for public-facing documents, we put considerable effort into getting the language and level of detail right. That said, the initial focus should be on compiling the bulk of its content in a complete and accurate way. Finessing the language comes later. 

With this in mind, your first draft of the postmortem should focus on pulling together the information to include, and checking its accuracy. It should reflect what happened, explain why, and include all the lessons we have (collectively) learned from the incident. Once you've gotten positive feedback on the first draft, it’s time to work on the language and tone of the document to make it ready for public consumption (assuming that the goal is to publish this document externally). This lets us focus on getting the content right and separates concerns by focusing one review on content and accuracy, and a separate review on language and tone

Reviewing the document

Ideally, your first draft should be reviewed by all SREs involved in the incident, who may have their own contributions and suggestions for lessons learned. At least two SREs should sign off on your first draft before you move on to the second pass, which also needs to be approved by at least two SREs before it's considered ready to be published.

Once your postmortem has been reviewed (if this is a public postmortem) you can copy and paste the content of your postmortem document into Statuspage. Keep in mind that formatting adjustments might be necessary, so we should be extra careful that the formatting is right before publishing it.

Make sure to share it with the rest of SRE (and possibly the wider Hosted Graphite team). A postmortem document is only useful if the lessons it contains are shared with and understood by the rest of the team, so ask everyone to take some time to read the postmortem, and add some comments to it.

The sections

This is a fairly simple template as we'd much rather focus on the process and content than on structure. What follows is a quick explanation of each section with some tips and examples of what to include.

Summary

This is just a quick summary/headline of what the incident was: what happened and what the impact was. Usually no longer than one or two sentences. After reading this a person unfamiliar with this incident should know what the impact was and why they should keep reading.

Background

Keep in mind: postmortems are technical documents that are sometimes read by an audience that's not fully technical. Sometimes even a technical audience could have trouble keeping up with a postmortem document if they're lacking the necessary context. In this section, try to provide any context that may be necessary to fully understand the rest of the document. For example, a postmortem we wrote might reference our internal canaries, so a quick explanation of what they are is useful (and bonus points for linking to our blog post about them).

Whenever possible, link to public documentation and/or a blog post. If we find ourselves explaining the same thing over and over it might be worth posting something to our blog that we can reference in the future.

What happened?

Here is where we tell the story of this incident. Everything that happened, including a timeline of events, goes here. This includes things that happened without our knowledge (i.e there was a traffic spike we weren't aware of) to alerts firing, to actions we took trying to mitigate the issue. For long-running incidents, the timeline sometimes gets a bit long. It makes sense to move it to its own section, that way people can refer to it if they want more detail while leaving the current section uncluttered.

Something that's valuable to add in this section is the specific information we had when we undertook a particular action. It's one thing saying that we added more capacity to a cluster but that didn't do anything, but it's way more useful to say that, for example, we saw increased load and latency on cluster X and therefore interpreted that as a capacity issue, deciding to increase capacity in the cluster. If later in the timeline that assumption is proven to be incorrect we'll know why we took that specific action and can work on making sure that we have more complete information next time.

What went well?

Not everything is terrible (I mean I guess...at least that's what people tell me!). Even during the worst incident ever some things will go well. Maybe our monitoring was very quick to react and alerted us just in time to prevent further impact. Maybe our service discovery routed around impacted nodes so the impact was only added latency instead of lost data.

For most incidents, there are circumstances that prevented it from being much worse. It's important to identify these circumstances, as they will increase our understanding of how the system works as much (if not more more) than figuring out the areas that made us particularly vulnerable.

What went badly?

There will be things that we're unhappy about: how the incident was handled, or where our tools/services let us down. This is a good place to point things we know that could have gone better. It’s also a good trigger to identify followup tasks for things to improve in the future.

There will also be factors that contributed to making this incident worse than we'd otherwise expect it to be. We want to be able to identify what these are, not just to fix, but so we can deepen our understanding of how our system works and the things that make less resilient.

What are we going to do in the future?

This is often the section our users care about most. What we've experienced is all well and good, but what have we actually learned, and more importantly, what are we going to do to make sure our users’ metrics don't go away again? The most important result of a postmortem is what we learn about our systems and how our people interact with them, but our users deserve to see those lessons become more specific actions.

With this in mind, any actions we plan to take in the future should have an open ticket to make sure they don't fall through the cracks. There should be a reference to the ticket in our internal document. For public postmortems, we don't include the full ticket but make reference to it (something like internal ticket: SRE-420). This will help us better track the progress of any follow-up items and enforce the fact that everything in this section needs to have a ticket attached to it. Those tickets should be clearly marked as incident followup (and also link to the original incident in our status page).

Example:

We will be sprinkling glitter over our servers to make them go faster (internal ticket: SRE-420)

A clear sign of a bad postmortem is one where we don't know what to put in this section, or think that we don't have anything to include. A postmortem that results in no follow ups just means we haven't been looking hard enough. If we claim to have learned valuable lessons from an incident but can't come up with anything to do differently next time maybe we didn't learn the right lessons. Remember, "have better luck next time" or "don't make mistakes next time" is not a valid strategy.

Things to keep in mind

Overall the tone of our postmortems should be similar to the tone of our status page communications (after all, they are written by the same team with roughly the same audience in mind). That means anything that has been covered in our Status page guidelines regarding the kind of language that we use also applies to the postmortem.

When reviewing a postmortem try to focus more on content than language. It's always important to get the language right (in particular for public documents) but what matters is the process that resulted in that document and the lessons and actions derived from it. Remember, when you review a postmortem document you're reviewing the process that led to it as much as the final document itself.

Other resources

There are too many great resources out there to list, but the following should be considered required reading (or watching!) on the topic:

Chapter 15 of the SRE book

"The infinite hows" - John Allspaw

Incidents as we Imagine Them Versus How They Actually Are - John Allspaw (video)

How complex systems fail - Richard Cook

The Multiple Audiences and Purposes of Post-Incident Reviews

Some Observations On the Messy Realities of Incident Reviews

Hindsight and sacrifice decisions




Fran Garcia

SRE at Hosted Graphite.

Related Posts

See why thousands of engineers trust Hosted Graphite with their monitoring

START A FREE TRIAL