July 2022

Classifying Severity Levels for Your Organization

Jul 29, 2022 By Nir Sharma In Squadcast

Major outages are bound to occur in even the most well-maintained infrastructure and systems. Being able to quickly classify the severity level also allows your on-call team to respond more effectively. Imagine a scenario where your on-call team is getting critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is inoperative revenue losses are mounting every minute. How do you go about getting your application back on track? This is where understanding incident severity and priority can be invaluable. In this blog we look at severity levels and how they can improve your incident response process.

Read Post

Squadcast

Read more about Classifying Severity Levels for Your Organization

Setting up Runbooks in Squadcast | SRE Best Practices | Squadcast

Jul 29, 2022 By Squadcast In Squadcast

A Runbook is a compilation of routine procedures and operations that are documented for reference while working on a critical incident. Sometimes, it can also be referred to as a Playbook. From this video, learn to create, attach, reference and mark progress for incident resolution using Runbooks.

View Video

Squadcast

Read more about Setting up Runbooks in Squadcast | SRE Best Practices | Squadcast

Introduction to reliability management

Jul 29, 2022 By Sumo Logic In Sumo Logic

Ensuring your digital customer experiences are exceptional is a goal of any modern business. However, managing the reliability of ever more complex applications is a challenge. Developers are releasing new capabilities in fast-moving sprints and the business wants maximum velocity with minimal risk. SRE teams create a structure of continuous improvement that focuses on ensuring the application is reliable above all else.

View Video

Sumo Logic

Read more about Introduction to reliability management

Site Reliability Engineering (SRE) explained

Jul 29, 2022 By Emiliano Pardo Saguier In InvGate

Google has introduced so many innovations that it’d be impossible to list them all. And we’re not just talking about the obvious things like search engine algorithms or nearly-ubiquitous programs and apps (Google Maps, Docs, Gmail) — not even self-driving cars. Today, we’re going to talk about one such innovation: Site Reliability Engineering. In a nutshell, SRE it’s a practical framework for software development that improves on even giants like DevOps. Wait, what?

Read Post

InvGate

Read more about Site Reliability Engineering (SRE) explained

Managing the Looker ecosystem at scale with SRE and DevOps practices

Jul 29, 2022 By Saurabh Bangad In Google Operations

Many organizations struggle to create data-driven cultures where each employee is empowered to make decisions based on data. This is especially true for enterprises with a variety of systems and tools in use across different teams. If you are a leader, manager, or executive focused on how your team can leverage Google's SRE practices or wider DevOps practices, definitely you are in the right place!

Read Post

Google Operations

Read more about Managing the Looker ecosystem at scale with SRE and DevOps practices

Introducing Our Newest Integration with ServiceNow

Jul 28, 2022 By Nicolas Philip In Blameless

Blameless just released a new integration to ServiceNow’s incident management ticketing solution. If you are a modern DevOps team moving towards SRE practices and you want to speed the time to incident resolution through streamlined, automated workflows, this is worth investigating.

Read Post

Blameless

Read more about Introducing Our Newest Integration with ServiceNow

How Retrospective Data Enhances Reliability Insights

Jul 21, 2022 By Emily Arnott In Blameless

When things go wrong, we try to learn for the next time. Every incident should be a learning opportunity to make your system more reliable for the future. Luckily with Blameless Reliability Insights, you can see patterns in incidents at a glance, right out of the box. In fact, the ability to tag incidents makes reliability data even more helpful by allowing you to collect granular details about reliability, especially as they pertain to your unique business needs. ‍

Read Post

Blameless

Read more about How Retrospective Data Enhances Reliability Insights

Why SREs Need to Embrace Chaos Engineering

Jul 20, 2022 By xMatters In xMatters

Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability. Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled.

Read Post

xMatters

Read more about Why SREs Need to Embrace Chaos Engineering

True Cost Unplanned Application Downtime

Jul 20, 2022 By Catrin Haberfield In Reliably

It can be a big can of worms, but tackling IT downtime can be the first step to major cost savings. Here’s everything you need to know about downtime but were too afraid to ask.

Read Post

Reliably

Read more about True Cost Unplanned Application Downtime

Top 12 Site Reliability Engineering (SRE) Tools

Jul 20, 2022 By Eyal Katz In Lightrun

Ben Treynor Sloss, then VP of Engineering at Google, coined the term “Site Reliability Engineering” in 2003. Site Reliability Engineering, or SRE, aims to build and run scalable and highly available systems. The philosophy behind Site Reliability Engineering is that developers should treat errors as opportunities to learn and improve. SRE teams constantly experiment and try new things to enhance their support systems.

Read Post

Lightrun

Read more about Top 12 Site Reliability Engineering (SRE) Tools

Incident Response Platform: What Is It & Do You Need One?

Jul 19, 2022 By Myra Nizami In Blameless

Looking into incident response platforms? We discuss what an incident response platform is, what tasks it handles, and the benefits of having one.

Read Post

Blameless

Read more about Incident Response Platform: What Is It & Do You Need One?

Achieving Five Nines With Reliability

Jul 16, 2022 By Catrin Haberfield In Reliably

Is 99.999% uptime realistic? We cover why you should care, and how you can achieve it.

Read Post

Reliably

Read more about Achieving Five Nines With Reliability

Top 10 Reasons For A Site Reliability Platform

Jul 15, 2022 By Mbaoma Mary In Reliably

A system’s reliability is one of the most important things that engineers should care about. They ensure customers are kept happy and keep organizations profitable. Investing in reliable processes and tools to ensure systems are reliable can be critical to company success. Site Reliability platforms are popular choice when it comes to monitoring and observing software services as they help make responding to and solving application problems easier.

Read Post

Reliably

Read more about Top 10 Reasons For A Site Reliability Platform

Promoted to SRE Advocate: A Dream Turned Reality

Jul 14, 2022 By Matt Davis In Blameless

I get chills thinking about a line from the first film adaptation of Roald Dahl's Charlie and the Chocolate Factory, Gene Wilder as Wonka nearly whispers it to Charlie, as if it is secret information: We are the music makers, and we are the dreamers of dreams. For me, the quote (taken from a poem by Arthur O'Shaughnessy) is austere: We are the creators of what we create, and what we create becomes what we are.

Read Post

Blameless

Read more about Promoted to SRE Advocate: A Dream Turned Reality

Monitoring Your Platform From Multiple Locations

Jul 14, 2022 By Andrei Danilov In Rootly

Mature start-ups and scale-ups create wonderful and challenging environments for Engineers. As the product they’re creating matures and the brand becomes a successful one, the user base generally starts growing, and, for some companies, in places they might not expect it to grow. As that happens, new challenges arise for Engineers. One of these challenges is pretty straightforward to guess. Basically having a particular product available throughout different regions of the world.

Read Post

Rootly

Read more about Monitoring Your Platform From Multiple Locations

How To Minimise Alert Fatigue In SRE

Jul 14, 2022 By Aimee Pearcy In Reliably

Alert fatigue occurs when people become desensitized to the overwhelming number of alerts they receive and are expected to respond to. Even though these alerts are typically easy to respond to, it is the sheer number of them that ultimately causes people to feel fatigued. The higher the number of alerts, the more likely it is that employees are likely to begin to ignore and potentially miss an important alert leading to bigger consequences.

Read Post

Reliably

Read more about How To Minimise Alert Fatigue In SRE

Amazon OpenSearch + Squadcast Integration: Routing Alerts Made Easy

Jul 12, 2022 By Vishal Padghan In Squadcast

Developers often find comfort in embracing open-source software for numerous reasons. One of the most important reasons is the freedom to use that software anywhere and how they wish to. Amazon OpenSearch is an open-source search and analytics suite derived from Elasticsearch. It lets you perform interactive log analytics and real-time application monitoring with ease.

Read Post

Squadcast

Read more about Amazon OpenSearch + Squadcast Integration: Routing Alerts Made Easy

7 ways tagging incidents can teach you about system health

Jul 12, 2022 By Emily Arnott In Blameless

One of the most powerful ways to prepare for future incidents is to study and learn from patterns in past incidents. Blameless Reliability Insights highlights these patterns for you, with out-of-the-box dashboards that automatically collect and present all types of statistical information about your incidents.

Read Post

Blameless

Read more about 7 ways tagging incidents can teach you about system health

Blameless Reliability Insights: How to Build Custom Reports

Jul 7, 2022 By Blameless In Blameless

Engineering teams use the Reliability Insights feature in Blameless to understand reliability in a holistic way. In addition to tracking incident data, you can keep a pulse on how well teams and workflows are operating. For example, some of the best ways to maximize value from Reliability Insights is to build reports that reflect how your team stays on task, communicates, and assigns responsibilities. In this series, we'll walk you through the most common reports we see reliability teams using and referring to regularly.

View Video

Blameless

Read more about Blameless Reliability Insights: How to Build Custom Reports

Custom Reliability Insights Reports: Follow Up Action Items

Jul 7, 2022 By Blameless In Blameless

View Video

Blameless

Read more about Custom Reliability Insights Reports: Follow Up Action Items

Blameless Reliability Insights: FUA (Follow Up Action) Statuses

Jul 7, 2022 By Blameless In Blameless

View Video

Blameless

Read more about Blameless Reliability Insights: FUA (Follow Up Action) Statuses

SRE Roles and Responsibilities Defined

Jul 6, 2022 By Myra Nizami In Blameless

SRE is a practice that creates a bridge between operations and development. We discuss the roles and responsibilities of a site reliability engineer.

Read Post

Blameless

Read more about SRE Roles and Responsibilities Defined

Improving Team Health With Reliably

Jul 2, 2022 By Sylvain Hellegouarch In Reliably

Improving team health within DevOps is vital for success in any engineering team. In this article, we’ll look at some of the ways that you can improve team health with Reliably so you can keep your developers happier, healthier and free from burnout.

Read Post

Reliably

Read more about Improving Team Health With Reliably

Operations | Monitoring | ITSM | DevOps | Cloud

July 2022

Classifying Severity Levels for Your Organization

Setting up Runbooks in Squadcast | SRE Best Practices | Squadcast

Introduction to reliability management

Site Reliability Engineering (SRE) explained

Managing the Looker ecosystem at scale with SRE and DevOps practices

Introducing Our Newest Integration with ServiceNow

How Retrospective Data Enhances Reliability Insights

Why SREs Need to Embrace Chaos Engineering

True Cost Unplanned Application Downtime

Top 12 Site Reliability Engineering (SRE) Tools

Incident Response Platform: What Is It & Do You Need One?

Achieving Five Nines With Reliability

Top 10 Reasons For A Site Reliability Platform

Promoted to SRE Advocate: A Dream Turned Reality

Monitoring Your Platform From Multiple Locations

How To Minimise Alert Fatigue In SRE

Amazon OpenSearch + Squadcast Integration: Routing Alerts Made Easy

7 ways tagging incidents can teach you about system health

Blameless Reliability Insights: How to Build Custom Reports

Custom Reliability Insights Reports: Follow Up Action Items

Blameless Reliability Insights: FUA (Follow Up Action) Statuses

SRE Roles and Responsibilities Defined

Improving Team Health With Reliably

Monthly Archive

Follow Us