June 2021

SRE Report 2021: The Highlights

Jun 30, 2021 By Anna Jones In Catchpoint

Our fourth annual SRE Report launched last week. I had the good fortune to be involved in writing and editing it this year for the first time alongside our very own driving force Leo Vasiliou and the brilliant Eveline Oehrlich at DevOps Institute (check out Eveline’s take on the report’s Key Takeaways here), in addition to a number of folks at VMware Tanzu.

Read Post

Catchpoint

Read more about SRE Report 2021: The Highlights

7 Essential Tools for SREs

Jun 25, 2021 By Quentin Rousseau In Rootly

From chaos engineering to monitoring and beyond, SREs rely on several key types of tools to do their jobs.

Read Post

Rootly

Read more about 7 Essential Tools for SREs

"Should SRE Be Broken Up?": SRE from Anywhere Recap

Jun 25, 2021 By Anna Jones In Catchpoint

This year’s SRE from Anywhere (SREFA) brought together hundreds of registrants from around the world to gather virtually, share experiences, and network around all things SRE. We were thrilled to see so many friendly faces!

Read Post

Catchpoint

Read more about "Should SRE Be Broken Up?": SRE from Anywhere Recap

Resilience in Action E8: Vanessa Yiu on Crafting Enterprise Architecture

Jun 23, 2021 By Blameless Community In Blameless

‍Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.

Read Post

Blameless

Read more about Resilience in Action E8: Vanessa Yiu on Crafting Enterprise Architecture

SREview Issue #14 June 2021

Jun 22, 2021 By Blameless Community In Blameless

Hoping you're headed towards a fun summer season and some time without masks. Let's avoid a new kind of tan-line! This newsletter shares useful industry content and an exciting Blameless product announcement. Find our fave tweets and events in the SRE and resilience engineering community. We're hiring! Check out the job openings here.

Read Post

Blameless

Read more about SREview Issue #14 June 2021

Practical Guide to SRE: Incident Severity Levels

Jun 17, 2021 By Nancy Chauhan In Rootly

Incident severity levels are a measurement of the impact an incident has on the business. Classifying the severity of an issue is critical to decide how quickly and efficiently problems get resolved.

Read Post

Rootly

Read more about Practical Guide to SRE: Incident Severity Levels

Catchpoint SRE Study Reveals a Global Drop in Toil, Warns of Looming Scalability Ceiling, and Highlights the Need for New Operational Capabilities

Jun 15, 2021 By Catchpoint In Catchpoint

Adoption of AIOps is slow.

Read Post

Catchpoint

Read more about Catchpoint SRE Study Reveals a Global Drop in Toil, Warns of Looming Scalability Ceiling, and Highlights the Need for New Operational Capabilities

SRE For Enterprise

Jun 15, 2021 By Blameless In Blameless

Kurt (Head of Strategy), Nicolas (Product Manager), and Paul (Customer Success Manager) from Blameless talk about: They conclude the webinar with an exciting product announcement! Stay tuned, stay blameless.

View Video

Blameless

Read more about SRE For Enterprise

Service quality and the rising need for enterprise SRE

Jun 15, 2021 By Valerie O'Connell In ServiceNow

In its DevOps 2021 survey of global IT professionals, Enterprise Management Associates (EMA) found that 95% of organizations with highly successful DevOps initiatives were predominantly decentralized and purposefully becoming more so as fast as possible (see Figure 1). This decentralization of development and DevOps teams is making site reliability engineering (SRE) both critical and difficult to achieve.

Read Post

ServiceNow

Read more about Service quality and the rising need for enterprise SRE

Complete Guide to Service Level Objectives (SLOs) That Work

Jun 11, 2021 By Noor-ul-Anam Ruqayya In Blameless

Wondering what Service Level Objectives (SLOs) are? In this article, we will explain service level objectives and how they relate to SLAs, SLIs, and error budgets. A Service Level Objective (SLO) is a reliability target, measured by a Service Level Indicator (SLI) and sometimes serves as a safeguard for a Service Level Agreement (SLA). SLOs represent customer happiness and guide the development team’s velocity.

Read Post

Blameless

Read more about Complete Guide to Service Level Objectives (SLOs) That Work

Here's what SLIs AREN'T

Jun 10, 2021 By Emily Arnott In Blameless

SLIs, or service level indicators, are powerful metrics of service health. They’re often built up from simpler metrics that are monitored from the system. SLIs transform lower level machine data into something that captures user happiness. Your organization might already have processes with this same goal. Techniques like real-time telemetry and using synthetic data also build metrics that meaningfully represent service health.

Read Post

Blameless

Read more about Here's what SLIs AREN'T

Smartsheet's SRE Team Takes Center Stage As It Hits The 8M User Mark

Jun 10, 2021 By Anna Jones In Catchpoint

Smartsheet was founded in 2005 with the mission of helping companies simplify and streamline how work is managed. Over three quarters of the Fortune 500 rely on Smartsheet. Through its enterprise platform for dynamic work, the platform aligns people and technology to help businesses move faster, drive innovation, and achieve more.

Read Post

Catchpoint

Read more about Smartsheet's SRE Team Takes Center Stage As It Hits The 8M User Mark

Are you an MS Teams shop? We've got you Covered with Blameless Incident Resolution

Jun 7, 2021 By Blameless Community In Blameless

We have an exciting announcement. Blameless is providing early access to our Microsoft Teams integration. SRE and engineering teams can now resolve incidents faster without leaving the comfort of their favorite messaging tool. With the Blameless incident resolution product, Microsoft Teams users can now reduce toil in routine incident response processes through automation, codify processes with checklists, and craft retrospectives with the ‘add to timeline’ command.

Read Post

Blameless

Read more about Are you an MS Teams shop? We've got you Covered with Blameless Incident Resolution

How Lowe's meets customer demand with Google SRE practices

Jun 7, 2021 By Vivek Balivada In Google Operations

At Lowe’s, we’ve made significant progress in our multiyear technology transformation. To modernize our systems and build new capabilities for our customers and associates, we leverage Google’s SRE framework and Google Cloud, which helps us meet their needs faster and more effectively. With these efforts, we’ve been able to go from one release every two weeks to 20+ releases daily—about 20X more releases per month.

Read Post

Google Operations

Read more about How Lowe's meets customer demand with Google SRE practices

Alert Stalking

Jun 7, 2021 By Splunk In Splunk

For SREs, alerts are part of the job. But they shouldn’t be when you’re not on call, or when the problem isn’t yours. Watch a day in the life of an SRE, and see how Splunk Observability Cloud helps him put useless alerts—and complexity—to bed.

View Video

Splunk

Read more about Alert Stalking

The Incident Review: 4 Times When Typos Brought Down Critical Systems

Jun 3, 2021 By JJ Tang In Rootly

Sometimes, as these 4 incidents highlight, major failure results from a mere typo or configuration oversight.

Read Post

Rootly

Read more about The Incident Review: 4 Times When Typos Brought Down Critical Systems

Error Budgets Explained (And How to Make One for Your Team)

Jun 2, 2021 By Noor-ul-Anam Ruqayya In Blameless

Wondering what error budgets (EBs) are and how they are useful? We explain what they are, how they are defined, and how they can help your team. An error budget is the amount of acceptable unreliability a service can have before customer happiness is impacted. If a service is well within its budget, the developers can take more risks in their releases. If not, developers need to make safer choices.

Read Post

Blameless

Read more about Error Budgets Explained (And How to Make One for Your Team)

Google Cloud, Vodafone and Datadog SRE Panel Webinar

Jun 1, 2021 By Datadog In Datadog

Since originating at Google, site reliability engineering (SRE) has enabled countless teams to effectively manage large-scale systems, improve the stability of complex services, and automate operational tasks using software. In this SRE panel, Yuri Grinshteyn (Customer Reliability Engineer, Google) will speak about the core principles of SRE and how the culture is practiced at Google. He will be joined by Llywelyn Griffith-Swain (SRE Manager, Vodafone), who will share Vodafone’s story of adopting SRE, lessons learned, and their best practices for maintaining the cultural shift across teams.

View Video

Datadog

Read more about Google Cloud, Vodafone and Datadog SRE Panel Webinar

Operations | Monitoring | ITSM | DevOps | Cloud

June 2021

SRE Report 2021: The Highlights

7 Essential Tools for SREs

"Should SRE Be Broken Up?": SRE from Anywhere Recap

Resilience in Action E8: Vanessa Yiu on Crafting Enterprise Architecture

SREview Issue #14 June 2021

Practical Guide to SRE: Incident Severity Levels

Catchpoint SRE Study Reveals a Global Drop in Toil, Warns of Looming Scalability Ceiling, and Highlights the Need for New Operational Capabilities

SRE For Enterprise

Service quality and the rising need for enterprise SRE

Complete Guide to Service Level Objectives (SLOs) That Work

Here's what SLIs AREN'T

Smartsheet's SRE Team Takes Center Stage As It Hits The 8M User Mark

Are you an MS Teams shop? We've got you Covered with Blameless Incident Resolution

How Lowe's meets customer demand with Google SRE practices

Alert Stalking

The Incident Review: 4 Times When Typos Brought Down Critical Systems

Error Budgets Explained (And How to Make One for Your Team)

Google Cloud, Vodafone and Datadog SRE Panel Webinar

Monthly Archive

Follow Us