Operations | Monitoring | ITSM | DevOps | Cloud

Blameless

Building Reliability Through Culture with Veteran Google SRE, Steve McGhee

Which of the following three scenarios do you experience the most when a new incident occurs? For many teams, incidents unfortunately fall into scenario 1, with some classes of incidents catching them by surprise. It’s astonishing that despite the vast amount of time we spend working on and thinking about our systems, we seem to have very little control over them. If we can’t predict where the next incidents will come from, then we will be forever stuck in a reactive cycle of repair.

Improving Postmortem Practices with Veteran Google SRE, Steve McGhee

For many SREs, Google’s 99.999% availability seems like an untouchable dream. If anything, getting out of pager hell is already worth celebrating with all your coworkers, friends, and family on the moon. How can teams climb out of it? How can you get to a stage where you have time to proactively prevent incidents, and enter a mental state of calm and control? The rope out of pager hell is weaved with a thorough and rigorous postmortem process.