Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

5 Best Practices on Nailing Postmortems

Reading about postmortem best practices can sometimes be quite different from seeing them in action. Postmortems are like snowflakes; no two will ever look the same. There isn’t a definitive template for success that will work in every situation, but there are some practices and procedures when writing postmortems that can help. Here are five practices that can boost the effectiveness of your postmortems, with examples of postmortems or procedures that demonstrate these methods.

Building Reliability Through Culture with Veteran Google SRE, Steve McGhee

Which of the following three scenarios do you experience the most when a new incident occurs? For many teams, incidents unfortunately fall into scenario 1, with some classes of incidents catching them by surprise. It’s astonishing that despite the vast amount of time we spend working on and thinking about our systems, we seem to have very little control over them. If we can’t predict where the next incidents will come from, then we will be forever stuck in a reactive cycle of repair.

Improving Postmortem Practices with Veteran Google SRE, Steve McGhee

For many SREs, Google’s 99.999% availability seems like an untouchable dream. If anything, getting out of pager hell is already worth celebrating with all your coworkers, friends, and family on the moon. How can teams climb out of it? How can you get to a stage where you have time to proactively prevent incidents, and enter a mental state of calm and control? The rope out of pager hell is weaved with a thorough and rigorous postmortem process.