Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Postmortems Now Called Retrospectives in Blameless

Something big happened at Blameless this month — our “Postmortem” feature was updated to its new name, “Retrospective”. To the naysayer, I suppose you’re thinking, This seems trivial. Different teams call it different names anyway, so why bother making the change? First let me say, thank you for reading our blog and I hope you finish this one through to the end. Now, allow me to explain our reasoning and why we’re excited about this update.

Alert Fatigue in SRE: What It Is & How To Avoid It

Wondering about alert fatigue? We describe what it is, how it affects software development teams, and how to avoid it. What is alert fatigue? Alert fatigue is the phenomenon of employees becoming desensitized to alert messages because of the overwhelming volume they receive, and the number of false positives they receive. The risk with alert fatigue is that important information will be overlooked or ignored.

Quickly troubleshoot application errors with Error Reporting

Are you familiar with the four golden signals of Site Reliability Engineering (SRE): latency, traffic, errors, and saturation? Whether you’re a developer or an operator, you’ve likely been responsible for collecting, storing, or analyzing the data associated with these concepts. Much of this data is captured in application and infrastructure logs, which provide a rich history of what is happening behind the scenes in your workloads.

Traditional vs Modern Incident Response

An incident is an event (network outage, system failure, data breach, etc.) that can lead to loss of, or disruption to, an organization's operations, services or functions. Incident Response is an organization’s effort to detect, analyze and correct the hazards caused due to an incident. In the most common cases, when an incident response is mentioned, it usually relates to security incidents. Sometimes incident response and incident management are more or less used interchangeably.

Service Level Objectives: Where do we start?

Most of us have heard about SLOs and what they mean but always found it hard to start adopting them across our teams. This video is a way to demystify the journey of adoption of SLOs, with examples of how several large companies like Disney adopted them. Whether you are new to the DevOps/SRE world or an experienced developer, you will learn a fresh approach to making software more reliable!

Everything you need to know about Squadcast and Microsoft Teams Integration

Microsoft Teams is one of the most versatile tools in terms of providing collaboration and chat solutions to numerous enterprises. We at Squadcast understand how important Microsoft Teams can be for your organization. Hence, we bring you this blog on Squadcast-Microsoft Teams integration that will tell you how this integration can help in improved incident management, effective collaboration and a lot more.

Top 13 Site Reliability Engineer (SRE) Tools

The role and responsibilities of a site reliability engineer (SRE) may vary depending on the size of the organization, and as such, so do site reliability engineer tools. For the most part, a site reliability engineer is focused on multiple tasks and projects at one time, so for most SREs, the various tools they use reflect their eve-evolving responsibilities.