Incident Management


Evolving Blameless' SRE Practices with Amy Tobey

At Blameless, we drink our own champagne, and aim to adopt a mindset of continuous learning to foster resilience. We believe that the adoption of SRE practices is one of the best ways to get there. Like most organizations, our early efforts to implement SRE were imperfect. However, through hard work, teamwork, and investing in what we believe is the most important feature (reliability), we have made significant changes to how we do SRE. And we’re getting better at it every day.


Crisis Management Automation for the Entire Organization with Dispatch - BSidesSF Preview

Managing security incidents can be a stressful job. You are dealing with many questions all at once. What’s the scope? Who do I need to engage? How do I manage all of this? As an Incident Commander (IC), you have many responsibilities. You’re responsible for driving an incident to resolution as quickly as possible, creating the resources necessary to document, collaborate, and communicate while helping identify, engage, and orient the right people.


How Feature Flags Support Incident Response and Management

Every year I get on some technical kick. These fascinations usually end up being some sort of design pattern or process. In 2020, I’m really into feature flags – a big fan. Feature flags are a relatively basic implementation of something we understand well, that user functionality is comprised of various technical components.


Importance of After-Hour Response Teams

Exceptional customer service is key in the world of IT, where something could go wrong at any given moment. This level of support equates to business retention, client satisfaction and high success rates and profits. In this post, I’ll introduce a hypothetical scenario, where “MSP Team A” provides 24×7 after-hours support to a valuable client.


AIOps: What's in a name?

Since the term ‘AIOps’ came into use in the monitoring sector a couple of years ago, there has been much confusion about what it means. We hear from users asking if they need it – a difficult question given that the answer depends on how you define it. Since there isn’t a broadly accepted definition, a range of vendors now market their products as AIOps offerings, even though these products cross subsectors and may not be directly competitive.


Structuring Your Teams for Software Reliability

How well positioned is your team to ship reliable software? What are the different roles in engineering that impact reliability, and how do you optimize the ratio of software engineers to SREs to DevOps within teams? These questions can be hard to answer in a quantifiable way, but projecting different scenarios using systems thinking can help. Will Larson’s blog post Modeling Reliability does just that, and serves as inspiration for this article.


The 5 Central Tenets of a Great On-Call Culture

Working for VictorOps, and now Splunk, has allowed me to experience the on-call process from several distinct angles. And, working in a customer-facing role, I’ve witnessed the full spectrum of DevOps maturity – from downright DevOps mastery to the kinds of nightmare scenarios that haunt the dreams of on-call professionals.


Got Game? Secrets of Great Incident Management

When his phone wakes him at two in the morning, operations engineer Andy Pearson knows it’s bad news. There’s a major server problem, and hundreds of client websites are down. Automated monitoring checks detected the outage within seconds, and paged the on-call engineer. This time, it’s Pearson in the hot seat. Pearson quickly confirms the issue is real and, escalates it to his boss, tech lead Lewis Carey.