Skip to main content

Post Outage Review

·5 mins

No matter how well you plan for something, you will inevitably have something go wrong that simply couldn’t be catered for. One day you will find yourself dealing with an unplanned outage. Drawing from your bag of holding, you resolve the problem but now you have to provide answers to The Business - and it’s not happy. Not many people enjoy dealing with The Business, so it’s best to promptly provide answers and prevent the situation from occurring ever again. After all, The Business is a fickle beast with multiple heads that appears to do it’s own nonsensical thing most of the time.

The beast tends to be out for blood, which lamb caused this outage? The many eyes dart around looking for the most weak and guilt ridden lamb to ravage and eviscerate from this world. You confront this twisted beast which contorts and shrieks before you, armed with three items.

  • The Glinting Sword of Fact.
  • The Defiant Shield of Time.
  • The Righteous Scroll of Prevention.

These items are forged through the fires of The Outage and the ritual of the Post Outage Review. They exist to provide answers, and prevent or mitigate further unplanned outages. There are many techniques and methods in which to perform the ritual, but our overarching goal is to identify flaws within processes. Once the flaws have been identified, steps can be taken to mitigate or prevent them from occurring again.

Incidentally, I just got back from my journey to acquire coffee in which one of the staff tripped the circuit breaker, cutting the power. From what I could observe, water hit a power outlet while someone was cleaning. The initial response from the barista was on par with “Which one of you idiots did something”. It’s telling that the knee jerk reaction was who did it, before shifting to what happened in order to resolve the current outage.

Sword and Shield #

During the outage you might make several calls, update the project lead on the problem status, make changes to systems, and wait for propagation. As you do these things, the action you took and a time stamp should be recorded. This is the data-set that the post incident review will be based on. Once the outage is resolved, the data can then be compiled (or just enter it such that it’s already compiled), forming a definitive list of everything that was executed, and when.

The primary person dealing with the outage should then flesh out the incident, so that others reading can have a grasp on what was going on. This involves taking the actions and forming proper sentences with context instead of just “10:06am - ran script”.

Once the data flourishes into information, it can be passed over to whoever is appropriate to interrogate the information. This could be the IT team, it might be the CIO and lead engineers, it mostly depends on the company.

With The Glinting Sword of Fact and The Defiant Shield of Time, you can now defend yourself against the fiery breath of the beast, and strike back when an opening presents itself. But these items alone cannot slay the beast.

The Scroll #

The review can now begin as all of the information required lays before you. This is the real meat that will actually provide actionable items to make your processes more robust. Ultimately you end up with a list of flaws in your processes, and ideas can be formulated to counter them.

One technique to expose flaws is the Five Whys. Iteratively asking why will lead you to discover the problems within your process.

Once a list of each process flaw has been discovered, you can then begin brainstorming ways to either prevent or mitigate the flaw. A process might require to be completely reworked, or slightly adjusted. It helps to have some kind of management involved as changes should actually be implemented, not just listed.

The Righteous Scroll of Prevention should be invoked when the beast begins to run out of steam, it will be the final blow required to best the beast. Although, there’s always the pacifist option - for this we’ll still need The Righteous Scroll of Prevention, but another item is required.

The Secret Potion of Explanation #

Now that everything has been identified and explored, you can present this in a business friendly way. A good example of what you’re trying to achieve would be when Amazon had their S3 Outage. This includes what happened, where the problems were, and the actions they plan on taking to prevent this from happening again. It’s got a good blend of “here’s what happened”, “we’re sorry”, and “here’s how we’re going to do better”.

Instead of invoking The Righteous Scroll of Prevention, you can take the essence of The Righteous Scroll of Prevention and condense it into a vial, sprinkle some ‘business words’ in there, add a caring twist, and shake gently. This produces The Secret Potion of Explanation. Now, provide this to the beast - note that the beast should simmer in rage before cooling off completely. Your results may vary, if symptoms persist, run away.