SRE

The latest News and Information on Service Reliability Engineering and related technologies.

How Retrospective Data Enhances Reliability Insights

Jul 21, 2022 By Emily Arnott In Blameless

When things go wrong, we try to learn for the next time. Every incident should be a learning opportunity to make your system more reliable for the future. Luckily with Blameless Reliability Insights, you can see patterns in incidents at a glance, right out of the box. In fact, the ability to tag incidents makes reliability data even more helpful by allowing you to collect granular details about reliability, especially as they pertain to your unique business needs. ‍

Read Post

Blameless

Read more about How Retrospective Data Enhances Reliability Insights

Why SREs Need to Embrace Chaos Engineering

Jul 20, 2022 By xMatters In xMatters

Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability. Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled.

Read Post

xMatters

Read more about Why SREs Need to Embrace Chaos Engineering

Top 12 Site Reliability Engineering (SRE) Tools

Jul 20, 2022 By Eyal Katz In Lightrun

Ben Treynor Sloss, then VP of Engineering at Google, coined the term “Site Reliability Engineering” in 2003. Site Reliability Engineering, or SRE, aims to build and run scalable and highly available systems. The philosophy behind Site Reliability Engineering is that developers should treat errors as opportunities to learn and improve. SRE teams constantly experiment and try new things to enhance their support systems.

Read Post

Lightrun

Read more about Top 12 Site Reliability Engineering (SRE) Tools

Incident Response Platform: What Is It & Do You Need One?

Jul 19, 2022 By Myra Nizami In Blameless

Looking into incident response platforms? We discuss what an incident response platform is, what tasks it handles, and the benefits of having one.

Read Post

Blameless

Read more about Incident Response Platform: What Is It & Do You Need One?

Promoted to SRE Advocate: A Dream Turned Reality

Jul 14, 2022 By Matt Davis In Blameless

I get chills thinking about a line from the first film adaptation of Roald Dahl's Charlie and the Chocolate Factory, Gene Wilder as Wonka nearly whispers it to Charlie, as if it is secret information: We are the music makers, and we are the dreamers of dreams. For me, the quote (taken from a poem by Arthur O'Shaughnessy) is austere: We are the creators of what we create, and what we create becomes what we are.

Read Post

Blameless

Read more about Promoted to SRE Advocate: A Dream Turned Reality

Monitoring Your Platform From Multiple Locations

Jul 14, 2022 By Andrei Danilov In Rootly

Mature start-ups and scale-ups create wonderful and challenging environments for Engineers. As the product they’re creating matures and the brand becomes a successful one, the user base generally starts growing, and, for some companies, in places they might not expect it to grow. As that happens, new challenges arise for Engineers. One of these challenges is pretty straightforward to guess. Basically having a particular product available throughout different regions of the world.

Read Post

Rootly

Read more about Monitoring Your Platform From Multiple Locations

Amazon OpenSearch + Squadcast Integration: Routing Alerts Made Easy

Jul 12, 2022 By Vishal Padghan In Squadcast

Developers often find comfort in embracing open-source software for numerous reasons. One of the most important reasons is the freedom to use that software anywhere and how they wish to. Amazon OpenSearch is an open-source search and analytics suite derived from Elasticsearch. It lets you perform interactive log analytics and real-time application monitoring with ease.

Read Post

Squadcast

Read more about Amazon OpenSearch + Squadcast Integration: Routing Alerts Made Easy

7 ways tagging incidents can teach you about system health

Jul 12, 2022 By Emily Arnott In Blameless

One of the most powerful ways to prepare for future incidents is to study and learn from patterns in past incidents. Blameless Reliability Insights highlights these patterns for you, with out-of-the-box dashboards that automatically collect and present all types of statistical information about your incidents.

Read Post

Blameless

Read more about 7 ways tagging incidents can teach you about system health

Blameless Reliability Insights: FUA (Follow Up Action) Statuses

Jul 7, 2022 By Blameless In Blameless

Engineering teams use the Reliability Insights feature in Blameless to understand reliability in a holistic way. In addition to tracking incident data, you can keep a pulse on how well teams and workflows are operating. For example, some of the best ways to maximize value from Reliability Insights is to build reports that reflect how your team stays on task, communicates, and assigns responsibilities. In this series, we'll walk you through the most common reports we see reliability teams using and referring to regularly.

View Video

Blameless

Read more about Blameless Reliability Insights: FUA (Follow Up Action) Statuses

Blameless Reliability Insights: How to Build Custom Reports

Jul 7, 2022 By Blameless In Blameless

Engineering teams use the Reliability Insights feature in Blameless to understand reliability in a holistic way. In addition to tracking incident data, you can keep a pulse on how well teams and workflows are operating. For example, some of the best ways to maximize value from Reliability Insights is to build reports that reflect how your team stays on task, communicates, and assigns responsibilities. In this series, we'll walk you through the most common reports we see reliability teams using and referring to regularly.

View Video