Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Site Reliability Engineering (SRE) explained

Google has introduced so many innovations that it’d be impossible to list them all. And we’re not just talking about the obvious things like search engine algorithms or nearly-ubiquitous programs and apps (Google Maps, Docs, Gmail) — not even self-driving cars. Today, we’re going to talk about one such innovation: Site Reliability Engineering. In a nutshell, SRE it’s a practical framework for software development that improves on even giants like DevOps. Wait, what?

Managing the Looker ecosystem at scale with SRE and DevOps practices

Many organizations struggle to create data-driven cultures where each employee is empowered to make decisions based on data. This is especially true for enterprises with a variety of systems and tools in use across different teams. If you are a leader, manager, or executive focused on how your team can leverage Google's SRE practices or wider DevOps practices, definitely you are in the right place!

Setting up Runbooks in Squadcast | SRE Best Practices | Squadcast

A Runbook is a compilation of routine procedures and operations that are documented for reference while working on a critical incident. Sometimes, it can also be referred to as a Playbook. From this video, learn to create, attach, reference and mark progress for incident resolution using Runbooks.

Introduction to reliability management

Ensuring your digital customer experiences are exceptional is a goal of any modern business. However, managing the reliability of ever more complex applications is a challenge. Developers are releasing new capabilities in fast-moving sprints and the business wants maximum velocity with minimal risk. SRE teams create a structure of continuous improvement that focuses on ensuring the application is reliable above all else.

Introducing Our Newest Integration with ServiceNow

Blameless just released a new integration to ServiceNow’s incident management ticketing solution. If you are a modern DevOps team moving towards SRE practices and you want to speed the time to incident resolution through streamlined, automated workflows, this is worth investigating.

How Retrospective Data Enhances Reliability Insights

When things go wrong, we try to learn for the next time. Every incident should be a learning opportunity to make your system more reliable for the future. Luckily with Blameless Reliability Insights, you can see patterns in incidents at a glance, right out of the box. In fact, the ability to tag incidents makes reliability data even more helpful by allowing you to collect granular details about reliability, especially as they pertain to your unique business needs. ‍

Why SREs Need to Embrace Chaos Engineering

Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability. Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled.

Top 12 Site Reliability Engineering (SRE) Tools

Ben Treynor Sloss, then VP of Engineering at Google, coined the term “Site Reliability Engineering” in 2003. Site Reliability Engineering, or SRE, aims to build and run scalable and highly available systems. The philosophy behind Site Reliability Engineering is that developers should treat errors as opportunities to learn and improve. SRE teams constantly experiment and try new things to enhance their support systems.