After action reports: post-incident investigations
When something unexpected happens within the digital operations remit, software engineers put on their deerstalker hats and wax their fussy little moustaches–metaphorically. It's their time to play detective as they unravel the evidence and create the reports to explain the recent IT incident. But unlike with a hat-wearing Sherlock Holmes or a hirsute Hercule Poirot, cliff-hanger endings are not encouraged in software engineering.
Helping the business rally from an incident means uncovering the culprit and ensuring that the 'crime' cannot happen again, solving the mystery, putting the world to rights, and closing the book on that chapter.
The biggest 'marks' (i.e., the most compelling victims) are business-critical applications. Outages or any disruption are seriously bad for business. Such incidents compel the engineering team to uncover root causes, remediate the issue, and change processes to hopefully stop it recurring–as fast as possible.
Righting the routine wrongs on the mean streets of digital operations city
One trap a would-be detective can fall into, in the style of the maligned Inspector Lestrade, is to accidentally obliterate the evidence by restarting a downed device or redeploying a service. Whilst this hits the immediate need to get the business operational, it doesn't help create resiliency by allowing learnings to be created and errors blocked from repeating.
- Lesson: It's important not to be hasty. Improving operations requires an understanding of causes before taking action to remediate and fully improve a system.
In fiction, the detective is often put upon to find a culprit, any culprit, fast. So digital operations teams will always face a tension between the urgent request to restore services quickly and the best practice of uncovering the evidence to identify an incident's code-level root causes. Whereas the hardboiled sleuth relies on their tarnished conscience and overactive mind to do the right thing, engineers rely on monitoring tools.
However, not all monitoring solutions support this use case. Configuration or code-level issues may well be uncovered with observability tools. Yet developers may require even more granular data beyond the scope of such tools. They often don't look for debug-level data because it may not be needed either for alerting or restoring services.
- Lesson: Clues like stack traces, top resource-consuming database queries, and heap, thread, and TCP-dumps often act like a fingerprint, leading to the guilty party. Operations teams must use the right tools to gather these before they are lost–and whilst services are being restored.
The wise engineer learns to only accept the evidence in front of them, and not leap to conclusions. Here are simple rules to follow in the footsteps of the best investigators:
- Making a hasty determination on a root cause can create further problems that lengthen an incident, wasting time.
- Document the steps taken during investigation and remediation. Someone else may need to follow the evidence trail.
- Log incidents and outcomes. A lack of insight and detail into whether issues have occurred before hides useful context from tomorrow's responders.
When the mean streets get meaner
Two advances hinder our engineering investigations, adding more drama to these pressures: Containerised applications and container orchestration. Such microservice architectures provide faster methods to restore availability, such as when redeploying a pod. Also, there are a smaller number of debugging utilities to use in containerised environments: Developers want to reduce the area of their images.
Ideally, to balance the needs of knowledge vs action, operations teams will make use of technologies that take instant action to capture and keep evidence as it also restores business operations.
What to do when you find the 'smoking gun' (debug state capture)
When the crime takes place (during the incident):
- The authorities are called (the issue is detected)
- The evidence is collected for analysis (retrieval of debug evidence)
- Evidence is put into custody, ready for the later courtroom drama (securely persist debug evidence)
- The crime scene police report back to HQ (they post a link in PagerDuty, Jira, Slack, etc.)
- The scene is put back in order for regular business (auto-remediate)
After the incident:
- The investigator receives the call from dispatch (developer picks up a ticket)
- The keen intellect considers all the clues (debug evidence is reviewed)
- The crucial element is found (the bug is identified)
- The wily detective sets a snare for the culprit (a fix gets developed)
- The culprit is ensnared and taken to court (the fix is shipped)
- The detective writes up their case files, to enable learnings to be drawn on in the future as a post-mortem or incident retrospective.
Runbooks can be set to trigger when an issue is detected. A rule can capture and share debug-level evidence to safe storage, and known fixes deployed. It's the equivalent of hiring a savant detective you can walk in, solve the crime and share the next steps in a flash, a Benoit Blanc (the Knives Out sleuth) for digital operations.
- Pick a solution with a wide spread of integrations that connect with services across all environments.
Elementary my dear Watson: It was all so obvious!
Incidents are stressful, but they only lengthen and get more stressful where best practices are not followed. A software engineer, like a sleuth, must keep calm and be methodical, taking notes as they go. And just as Holmes or Poirot wouldn't leave their apartments without a magnifying glass and tweezers, having the right tools at hand makes it more likely an incident will be resolved and stay that way next time.
When all is stress and strain, and the rain is coming down over the crime scene, our engineers must always capture the state of the application and environment during the incident–and what steps are being taken to remediate. Only with the evidence and the right tools can those 'little grey cells,' that famous deductive reasoning, come into play.