Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Top 10 Reasons For A Site Reliability Platform

A system’s reliability is one of the most important things that engineers should care about. They ensure customers are kept happy and keep organizations profitable. Investing in reliable processes and tools to ensure systems are reliable can be critical to company success. Site Reliability platforms are popular choice when it comes to monitoring and observing software services as they help make responding to and solving application problems easier.

Promoted to SRE Advocate: A Dream Turned Reality

I get chills thinking about a line from the first film adaptation of Roald Dahl's Charlie and the Chocolate Factory, Gene Wilder as Wonka nearly whispers it to Charlie, as if it is secret information: We are the music makers, and we are the dreamers of dreams. For me, the quote (taken from a poem by Arthur O'Shaughnessy) is austere: We are the creators of what we create, and what we create becomes what we are.

Monitoring Your Platform From Multiple Locations

Mature start-ups and scale-ups create wonderful and challenging environments for Engineers. As the product they’re creating matures and the brand becomes a successful one, the user base generally starts growing, and, for some companies, in places they might not expect it to grow. As that happens, new challenges arise for Engineers. One of these challenges is pretty straightforward to guess. Basically having a particular product available throughout different regions of the world.

How To Minimise Alert Fatigue In SRE

Alert fatigue occurs when people become desensitized to the overwhelming number of alerts they receive and are expected to respond to. Even though these alerts are typically easy to respond to, it is the sheer number of them that ultimately causes people to feel fatigued. The higher the number of alerts, the more likely it is that employees are likely to begin to ignore and potentially miss an important alert leading to bigger consequences.

Amazon OpenSearch + Squadcast Integration: Routing Alerts Made Easy

Developers often find comfort in embracing open-source software for numerous reasons. One of the most important reasons is the freedom to use that software anywhere and how they wish to. Amazon OpenSearch is an open-source search and analytics suite derived from Elasticsearch. It lets you perform interactive log analytics and real-time application monitoring with ease.

7 ways tagging incidents can teach you about system health

One of the most powerful ways to prepare for future incidents is to study and learn from patterns in past incidents. Blameless Reliability Insights highlights these patterns for you, with out-of-the-box dashboards that automatically collect and present all types of statistical information about your incidents.

Custom Reliability Insights Reports: Follow Up Action Items

Engineering teams use the Reliability Insights feature in Blameless to understand reliability in a holistic way. In addition to tracking incident data, you can keep a pulse on how well teams and workflows are operating. For example, some of the best ways to maximize value from Reliability Insights is to build reports that reflect how your team stays on task, communicates, and assigns responsibilities. In this series, we'll walk you through the most common reports we see reliability teams using and referring to regularly.

Blameless Reliability Insights: FUA (Follow Up Action) Statuses

Engineering teams use the Reliability Insights feature in Blameless to understand reliability in a holistic way. In addition to tracking incident data, you can keep a pulse on how well teams and workflows are operating. For example, some of the best ways to maximize value from Reliability Insights is to build reports that reflect how your team stays on task, communicates, and assigns responsibilities. In this series, we'll walk you through the most common reports we see reliability teams using and referring to regularly.

Blameless Reliability Insights: How to Build Custom Reports

Engineering teams use the Reliability Insights feature in Blameless to understand reliability in a holistic way. In addition to tracking incident data, you can keep a pulse on how well teams and workflows are operating. For example, some of the best ways to maximize value from Reliability Insights is to build reports that reflect how your team stays on task, communicates, and assigns responsibilities. In this series, we'll walk you through the most common reports we see reliability teams using and referring to regularly.