Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

How Retrospective Data Enhances Reliability Insights

When things go wrong, we try to learn for the next time. Every incident should be a learning opportunity to make your system more reliable for the future. Luckily with Blameless Reliability Insights, you can see patterns in incidents at a glance, right out of the box. In fact, the ability to tag incidents makes reliability data even more helpful by allowing you to collect granular details about reliability, especially as they pertain to your unique business needs. ‍

Why SREs Need to Embrace Chaos Engineering

Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability. Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled.

Top 12 Site Reliability Engineering (SRE) Tools

Ben Treynor Sloss, then VP of Engineering at Google, coined the term “Site Reliability Engineering” in 2003. Site Reliability Engineering, or SRE, aims to build and run scalable and highly available systems. The philosophy behind Site Reliability Engineering is that developers should treat errors as opportunities to learn and improve. SRE teams constantly experiment and try new things to enhance their support systems.

Promoted to SRE Advocate: A Dream Turned Reality

I get chills thinking about a line from the first film adaptation of Roald Dahl's Charlie and the Chocolate Factory, Gene Wilder as Wonka nearly whispers it to Charlie, as if it is secret information: We are the music makers, and we are the dreamers of dreams. For me, the quote (taken from a poem by Arthur O'Shaughnessy) is austere: We are the creators of what we create, and what we create becomes what we are.

Monitoring Your Platform From Multiple Locations

Mature start-ups and scale-ups create wonderful and challenging environments for Engineers. As the product they’re creating matures and the brand becomes a successful one, the user base generally starts growing, and, for some companies, in places they might not expect it to grow. As that happens, new challenges arise for Engineers. One of these challenges is pretty straightforward to guess. Basically having a particular product available throughout different regions of the world.

Amazon OpenSearch + Squadcast Integration: Routing Alerts Made Easy

Developers often find comfort in embracing open-source software for numerous reasons. One of the most important reasons is the freedom to use that software anywhere and how they wish to. Amazon OpenSearch is an open-source search and analytics suite derived from Elasticsearch. It lets you perform interactive log analytics and real-time application monitoring with ease.

7 ways tagging incidents can teach you about system health

One of the most powerful ways to prepare for future incidents is to study and learn from patterns in past incidents. Blameless Reliability Insights highlights these patterns for you, with out-of-the-box dashboards that automatically collect and present all types of statistical information about your incidents.

Blameless Reliability Insights: FUA (Follow Up Action) Statuses

Engineering teams use the Reliability Insights feature in Blameless to understand reliability in a holistic way. In addition to tracking incident data, you can keep a pulse on how well teams and workflows are operating. For example, some of the best ways to maximize value from Reliability Insights is to build reports that reflect how your team stays on task, communicates, and assigns responsibilities. In this series, we'll walk you through the most common reports we see reliability teams using and referring to regularly.

Blameless Reliability Insights: How to Build Custom Reports

Engineering teams use the Reliability Insights feature in Blameless to understand reliability in a holistic way. In addition to tracking incident data, you can keep a pulse on how well teams and workflows are operating. For example, some of the best ways to maximize value from Reliability Insights is to build reports that reflect how your team stays on task, communicates, and assigns responsibilities. In this series, we'll walk you through the most common reports we see reliability teams using and referring to regularly.