Incident Management


How SRE's can Embrace Resilience During Crises

Blameless recently had the privilege of hosting SRE leaders Liz Fong-Jones, Dave Rensin, and Alex Hidalgo to discuss how SREs can embrace resilience during pandemic, and how the principles of SRE intersect with global trends. The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.


Our new Incident Management Experience: The Panopta Incident Hub

As a monitoring platform built by systems engineers, for systems engineers, we wanted to create an incident management experience that brings together all of the important tools that systems engineers use on a daily basis. From alert to resolution, the Panopta Incident Hub offers efficient, streamlined solutions for incident management tasks and simplifies collaboration among team members.


Q&A with Alex Hidalgo on SLOs

Alex Hidalgo is a Site Reliability Engineer at Squarespace, and he’s currently writing a book called Implementing Service Level Objectives for O’Reilly Media. The first three chapters of the book are available now through O’Reilly’s early access program. I had a chance to read those chapters and ask Alex some questions about service level objectives and reliability. Thanks, Alex, for sharing your knowledge.


How AI Helps IT Ops Pros Work Remotely

While the COVID-19 pandemic reshapes work processes, digitalization is allowing businesses to adjust to the fluid situation. The deployment of AI in IT operations is a good case study of this. Human beings’ social dimension needs cultivation. Otherwise, people become unhappy and perform ineffectively. Beyond that, many tasks require social interaction to be executed successfully, including in IT operations.

Coronavirus: From the Office to Working From Home

Coronavirus (COVID-19) is greatly impacting the lives of organizations, employees and stakeholders. With the outbreak’s rising impact, more employees are migrating to remote, work-from-home practices as means of achieving “social distancing.” However, inevitable challenges are emerging with remote workdays. Obstacles include, but aren’t limited to, employee isolation, diminished productivity and poor team communication or collaboration.


Modernizing and Consolidating Your Monitoring Without Losing It...

The current days of remote work and “IT Ops from home” may or may not be here to stay, but they definitely reinforce the need for consolidating and modernizing our monitoring. The challenges which multiple siloed tools create for understanding the big picture are only exacerbated by having just one screen to look at when monitoring our IT from our kitchen table.


Best Practices for Pragmatic Incident Command

The goal of this piece is to provide some practical advice on how teams can coordinate and respond to complex, dynamic incidents. After all, incidents are unplanned investments that surface valuable learnings for improvement. For the purposes of this blog, we define incidents as situations where there is a need for coordination among multiple people working on the same problem. There will be incidents where this is not the case.