San Mateo, CA, USA
Mar 24, 2020 | By Hannah Culver
No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.
Mar 19, 2020 | By Emily Arnott
On-call: you may see it as a necessary evil. When responding to incidents quickly can make or break your reputation, designating people across the team to be ready to react at all hours of the day is a necessity, but often creates immense stress while eating into personal lives. It isn’t a surprise that many engineers have horror stories about the difficulty of carrying a pager around the clock. But does on-call have to be so dreadful? We think not.
Mar 19, 2020 | By Jacob Warren
Detailed and specific description of impact? Check. In-depth root cause analysis? Check. Clearly defined and easy to follow resolution? Check. Postmortems present an incredible learning opportunity, despite the inherent cost of time and effort. They ensure an incident is documented, that all contributing factors are understood, and that effective preventative actions have been put in place to reduce the likelihood or impact of recurrence.
Mar 16, 2020 | By Emily Arnott
In response to recent events, many organizations are implementing social distancing programs such as remote work. Successfully transitioning to remote work does come with challenges, but the right practices and attitudes can make it much less painful (and safer for you than heading into the office). We like to think of incidents as “unplanned investments,” and a sudden switch to remote work could be considered an unplanned investment of its own.
Mar 12, 2020 | By Hannah Culver
With remote work becoming more common, and distributed teams the norm, incident response has become even trickier. Years ago, everyone would gather in a war room and sort through the issue together, boots on the ground. Now, things have shifted. Remote work is only projected to increase, and teams need to be able to adapt in order to resolve incidents quickly and efficiently, even if team members are a thousand miles away. But how can we make great incident response a reality?