%term

The latest News and Information on Service Reliability Engineering and related technologies.

Smartsheet's SRE Team Takes Center Stage As It Hits The 8M User Mark

Jun 10, 2021 By Anna Jones In Catchpoint

Smartsheet was founded in 2005 with the mission of helping companies simplify and streamline how work is managed. Over three quarters of the Fortune 500 rely on Smartsheet. Through its enterprise platform for dynamic work, the platform aligns people and technology to help businesses move faster, drive innovation, and achieve more.

Read Post

Catchpoint

Read more about Smartsheet's SRE Team Takes Center Stage As It Hits The 8M User Mark

Alert Stalking

Jun 7, 2021 By Splunk In Splunk

For SREs, alerts are part of the job. But they shouldn’t be when you’re not on call, or when the problem isn’t yours. Watch a day in the life of an SRE, and see how Splunk Observability Cloud helps him put useless alerts—and complexity—to bed.

View Video

Splunk

Read more about Alert Stalking

How Lowe's meets customer demand with Google SRE practices

Jun 7, 2021 By Vivek Balivada In Google Operations

At Lowe’s, we’ve made significant progress in our multiyear technology transformation. To modernize our systems and build new capabilities for our customers and associates, we leverage Google’s SRE framework and Google Cloud, which helps us meet their needs faster and more effectively. With these efforts, we’ve been able to go from one release every two weeks to 20+ releases daily—about 20X more releases per month.

Read Post

Google Operations

Read more about How Lowe's meets customer demand with Google SRE practices

The Incident Review: 4 Times When Typos Brought Down Critical Systems

Jun 3, 2021 By JJ Tang In Rootly

Sometimes, as these 4 incidents highlight, major failure results from a mere typo or configuration oversight.

Read Post

Rootly

Read more about The Incident Review: 4 Times When Typos Brought Down Critical Systems

Google Cloud, Vodafone and Datadog SRE Panel Webinar

Jun 1, 2021 By Datadog In Datadog

Since originating at Google, site reliability engineering (SRE) has enabled countless teams to effectively manage large-scale systems, improve the stability of complex services, and automate operational tasks using software. In this SRE panel, Yuri Grinshteyn (Customer Reliability Engineer, Google) will speak about the core principles of SRE and how the culture is practiced at Google. He will be joined by Llywelyn Griffith-Swain (SRE Manager, Vodafone), who will share Vodafone’s story of adopting SRE, lessons learned, and their best practices for maintaining the cultural shift across teams.

View Video

Datadog

Read more about Google Cloud, Vodafone and Datadog SRE Panel Webinar

Incident Management vs. Incident Response - What's the Difference?

May 28, 2021 By Quentin Rousseau In Rootly

What are the differences between incident management and incident response? The answer varies widely depending on whom you ask.

Read Post

Rootly

Read more about Incident Management vs. Incident Response - What's the Difference?

The Incident Review: 4 Odd Incidents Caused by Animals

May 21, 2021 By JJ Tang In Rootly

Incidents and outages caused by animals highlight the importance of flexibility and out-of-the-box thinking when it comes to SRE.

Read Post

Rootly

Read more about The Incident Review: 4 Odd Incidents Caused by Animals

SRE Availability Metrics

May 17, 2021 By John Hasinsky In PagerTree

How available is your website, service, or platform? What must you monitor and measure to ensure availability? How do you translate uptime into availability? This chart has numbers that every Site Reliability Engineer (SRE) should know. Below the chart, you will find answers to commonly asked questions about SRE and associated metrics.

Read Post

PagerTree

Read more about SRE Availability Metrics

A Day in the Life: Intelligent Observability at Work with our SRE, Dinesh

May 17, 2021 By Helen Beal In Moogsoft

When I asked Charlie for permission to attend this year’s AICon (virtual, natch) I thought it would be a shoo-in; learning’s part of my OKRs after all. But he never makes things easy and his ‘yes’ came with a caveat that’s typical when dealing with him. This time, he claimed he didn’t have the budget for the ticket (a likely story!) and I’d have to find another way to get one.

Read Post

Moogsoft

Read more about A Day in the Life: Intelligent Observability at Work with our SRE, Dinesh

Practical Guide to SRE: Using SLOs to Increase Reliability

May 13, 2021 By Quentin Rousseau In Rootly

Service Level Objectives (SLOs) are a key component of any successful Site Reliability Engineering initiative. The question is, what are SLOs; and how do you determine what your SLOs should be? Once you've done that, how should you use them?

Read Post