Incident Management (class SRE implements DevOps)
In the previous video, Liz and Seth discussed how to make systems observable and how observability helps us diagnose failing systems, but didn't cover what to do when an incident grows beyond the ability of one person to do it all. In this video, you learn about the most important part of the incident management process – humans.
In the stressful moments of systems failure, it is important to define clear, concise roles for all the humans involved in an incident. With too few people, you can quickly become overloaded with work, but with too many people, work may be duplicated (i.e. too many hands on the keyboard). Learn how SREs effectively manage incidents with clearly defined roles and responsibilities such as the operations lead, planning lead, communications lead, logistics lead, and more. Seth and Liz also discuss techniques for managing long-running and exponentially complex incidents.