SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Site Reliability Chats (Mar 2, 2022)

Mar 2, 2022 By Gremlin In Gremlin

Welcome to the first episode of Site Reliability Chats with your hosts Jason Yee @gitbisect and Julie Gunderson @julie_gund.

View Video

Gremlin

Read more about Site Reliability Chats (Mar 2, 2022)

Postmortems Now Called Retrospectives in Blameless

Mar 2, 2022 By Blameless In Blameless

Something big happened at Blameless this month — our “Postmortem” feature was updated to its new name, “Retrospective”. To the naysayer, I suppose you’re thinking, This seems trivial. Different teams call it different names anyway, so why bother making the change? First let me say, thank you for reading our blog and I hope you finish this one through to the end. Now, allow me to explain our reasoning and why we’re excited about this update.

Read Post

Blameless

Read more about Postmortems Now Called Retrospectives in Blameless

Alert Fatigue in SRE: What It Is & How To Avoid It

Mar 1, 2022 By Emily Arnott In Blameless

Wondering about alert fatigue? We describe what it is, how it affects software development teams, and how to avoid it. What is alert fatigue? Alert fatigue is the phenomenon of employees becoming desensitized to alert messages because of the overwhelming volume they receive, and the number of false positives they receive. The risk with alert fatigue is that important information will be overlooked or ignored.

Read Post

Blameless

Read more about Alert Fatigue in SRE: What It Is & How To Avoid It

Quickly troubleshoot application errors with Error Reporting

Feb 28, 2022 By Eyamba Ita In Google Operations

Are you familiar with the four golden signals of Site Reliability Engineering (SRE): latency, traffic, errors, and saturation? Whether you’re a developer or an operator, you’ve likely been responsible for collecting, storing, or analyzing the data associated with these concepts. Much of this data is captured in application and infrastructure logs, which provide a rich history of what is happening behind the scenes in your workloads.

Read Post

Google Operations

Read more about Quickly troubleshoot application errors with Error Reporting

Traditional vs Modern Incident Response

Feb 24, 2022 By Kristijan Mitevski In Squadcast

An incident is an event (network outage, system failure, data breach, etc.) that can lead to loss of, or disruption to, an organization's operations, services or functions. Incident Response is an organization’s effort to detect, analyze and correct the hazards caused due to an incident. In the most common cases, when an incident response is mentioned, it usually relates to security incidents. Sometimes incident response and incident management are more or less used interchangeably.

Read Post

Squadcast

Read more about Traditional vs Modern Incident Response

SRE Tools (All of the Tools Your Team Needs)

Feb 24, 2022 By Myra Nizami In Blameless

Wondering about SRE Tools? We explain the best tools for every step of the SRE development process.

Read Post

Blameless

Read more about SRE Tools (All of the Tools Your Team Needs)

Incident Management Metrics | Choosing KPIs that Matter

Feb 22, 2022 By Noor-ul-Anam Ruqayya In Blameless

Wondering about incident management metrics? We explain what incident management metrics are, how to track them, and what to do with the information.

Read Post

Blameless

Read more about Incident Management Metrics | Choosing KPIs that Matter

Service Level Objectives: Where do we start?

Feb 22, 2022 By Last9 In Last9

Most of us have heard about SLOs and what they mean but always found it hard to start adopting them across our teams. This video is a way to demystify the journey of adoption of SLOs, with examples of how several large companies like Disney adopted them. Whether you are new to the DevOps/SRE world or an experienced developer, you will learn a fresh approach to making software more reliable!

View Video

Last9

Read more about Service Level Objectives: Where do we start?

Everything you need to know about Squadcast and Microsoft Teams Integration

Feb 21, 2022 By Vishal Padghan In Squadcast

Microsoft Teams is one of the most versatile tools in terms of providing collaboration and chat solutions to numerous enterprises. We at Squadcast understand how important Microsoft Teams can be for your organization. Hence, we bring you this blog on Squadcast-Microsoft Teams integration that will tell you how this integration can help in improved incident management, effective collaboration and a lot more.

Read Post

Squadcast

Read more about Everything you need to know about Squadcast and Microsoft Teams Integration

Top 13 Site Reliability Engineer (SRE) Tools

Feb 20, 2022 By Jacob Hall In Dotcom-Monitor

The role and responsibilities of a site reliability engineer (SRE) may vary depending on the size of the organization, and as such, so do site reliability engineer tools. For the most part, a site reliability engineer is focused on multiple tasks and projects at one time, so for most SREs, the various tools they use reflect their eve-evolving responsibilities.

Read Post