%term

The latest News and Information on Service Reliability Engineering and related technologies.

SRE fundamentals 2021: SLIs vs. SLAs. vs SLOs

May 7, 2021 By Adrian Hilton In Google Operations

A big part of ensuring the availability of your applications is establishing and monitoring service-level metrics—something that our Site Reliability Engineering (SRE) team does every day here at Google Cloud. The end goal of our SRE principles is to improve services and in turn the user experience. The concept of SRE starts with the idea that metrics should be closely tied to business objectives. In addition to business-level SLAs, we also use SLOs and SLIs in SRE planning and practice.

Read Post

Google Operations

Read more about SRE fundamentals 2021: SLIs vs. SLAs. vs SLOs

Practical Guide to SRE: Automating On-Call

May 6, 2021 By JJ Tang In Rootly

Let's all face it, on call work isn't fun. But it can be better. Even if you have to work on call, it would be nice to have at least some of the work done for you, before you drag yourself out of bed at 3am to respond to an incident.

Read Post

Rootly

Read more about Practical Guide to SRE: Automating On-Call

7 Ways SRE Is Changing IT Ops And How To Prepare For Those Changes

Apr 29, 2021 By Squadcast Community In Squadcast

SRE best practices are disrupting and catalyzing change in the ways organizations approach IT Operations. In this blog we look at 7 ways SRE is bringing this transition. ‍Site Reliability Engineering is a new practice that has been growing in popularity among many businesses. Also known as SRE, the new activity puts a premium on monitoring, tracking bugs, and creating systems and automations that solve the problem in the long term.

Read Post

Squadcast

Read more about 7 Ways SRE Is Changing IT Ops And How To Prepare For Those Changes

How Kubernetes Can Both Help and Hinder Incident Management Teams

Apr 29, 2021 By Quentin Rousseau In Rootly

Kubernetes makes it easier in certain ways to manage reliability. But incident response teams and SREs must also be prepared to handle the unique reliability challenges that K8s creates.

Read Post

Rootly

Read more about How Kubernetes Can Both Help and Hinder Incident Management Teams

Creating Chaos to Achieve Reliability

Apr 22, 2021 By JJ Tang In Rootly

How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.

Read Post

Rootly

Read more about Creating Chaos to Achieve Reliability

Using Coralogix + StackPulse to Automatically Enrich Alerts and Manage Incidents

Apr 20, 2021 By Jonathan Brown In Coralogix

Keeping digital services reliable is more important than ever. When something goes wrong in production, on-call teams face significant pressure to identify and resolve the incident quickly – in order to keep customers happy. But it can be difficult to get the right signals to the right person in a timely fashion.

Read Post

Coralogix

Read more about Using Coralogix + StackPulse to Automatically Enrich Alerts and Manage Incidents

Should You Be an SRE or a DevOps Engineer?

Apr 15, 2021 By Quentin Rousseau In Rootly

SREs may have better long-term job prospects, but DevOps might be an easier career to pursue.

Read Post

Rootly

Read more about Should You Be an SRE or a DevOps Engineer?

Creating Custom Slack Commands

Apr 15, 2021 By FireHydrant In FireHydrant

Site Reliability Engineers are expected to know everything that’s happening, all of the time. That’s a lot of things! To help you sift through the noise, we’ve developed a feature that lets you find accurate data about your organization on-demand. You can do this by sending custom-designed commands to FireHydrant directly from your integrated Slack account.

Read Post

FireHydrant

Read more about Creating Custom Slack Commands

Catchpoint Announces Virtual SRE Community Event on June 10

Apr 13, 2021 By Catchpoint In Catchpoint

'SRE From Anywhere' will be the largest community event for Site Reliability Engineers to learn and share best practices for delivering best digital performance.

Read Post

Catchpoint

Read more about Catchpoint Announces Virtual SRE Community Event on June 10

How Would an SRE Conduct a Postmortem on the Suez Canal Incident?

Apr 7, 2021 By JJ Tang In Rootly

The Suez Canal has been big news over the last couple of weeks. We wondered how a Site Reliability Engineer (SRE) might conduct a postmortem on what happened with the Ever Given, and what that might mean if a comparable incident occurred at a modern tech company.

Read Post