May 2021

The 7 SRE Principles [And How to Put Them Into Practice]

May 31, 2021 By Emily Arnott In Blameless

Whether you're just adopting SRE or optimizing your current processes, we can help. We’ll explain the 7 key principles of SRE and how to put them into practice. So, what are the SRE principles? The fundamental SRE principles are: SRE is a method that operates through principles. Instead of prescribing specific solutions, it guides you with best practices. These SRE principles help organizations decide what's best for them. Once you understand the principles, you can apply them in many areas.

Read Post

Blameless

Read more about The 7 SRE Principles [And How to Put Them Into Practice]

Incident Management vs. Incident Response - What's the Difference?

May 28, 2021 By Quentin Rousseau In Rootly

What are the differences between incident management and incident response? The answer varies widely depending on whom you ask.

Read Post

Rootly

Read more about Incident Management vs. Incident Response - What's the Difference?

What do site reliability engineers do?

May 25, 2021 By Emily Arnott In Blameless

Are you considering adopting SRE? We will explain the roles and responsibilities of an SRE team within your organization, and how to start building one. So what does an SRE team do? An SRE team is responsible for building software that improves the resiliency of systems, implementing fixes, responding to incidents, and automating processes whenever possible. Site reliability engineering is a holistic practice that incorporates various types of work.

Read Post

Blameless

Read more about What do site reliability engineers do?

Blameless Runbook Documentation is Now Generally Available!

May 25, 2021 By Blameless Community In Blameless

At Blameless, our mission is to provide teams with the tools they need to operationalize SRE and embrace a culture of resilience. We help teams automate toil and adopt best practices across integrated incident management, comprehensive retrospectives, service level objectives, reliability insights, and more. We are very excited to announce that Blameless Runbook Documentation is now generally available for all customers.

Read Post

Blameless

Read more about Blameless Runbook Documentation is Now Generally Available!

SRE Culture [How to Build a Better Team]

May 24, 2021 By Emily Arnott In Blameless

If you're just adopting SRE or improving your current environment, we’ll help explain SRE culture and how to create a blameless development process. So what is SRE Culture? SRE Culture is founded on these main tenets.

Read Post

Blameless

Read more about SRE Culture [How to Build a Better Team]

The Incident Review: 4 Odd Incidents Caused by Animals

May 21, 2021 By JJ Tang In Rootly

Incidents and outages caused by animals highlight the importance of flexibility and out-of-the-box thinking when it comes to SRE.

Read Post

Rootly

Read more about The Incident Review: 4 Odd Incidents Caused by Animals

Resilience in Action Episode 7: Killing Ops with Tony Hansmann

May 19, 2021 By Blameless Community In Blameless

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.

Read Post

Blameless

Read more about Resilience in Action Episode 7: Killing Ops with Tony Hansmann

SREview Issue #13 May 2021

May 18, 2021 By Blameless Community In Blameless

Is it a coincidence that “May” and “yay” rhyme? Probably not. This month has been pretty exciting for us here at Blameless, and we’d love to share why. We also have some of our favorite Tweets, content, and events happening in the SRE and resilience engineering community this month.

Read Post

Blameless

Read more about SREview Issue #13 May 2021

SRE in the Wild: Great Development Culture through Error Budgets

May 18, 2021 By Reliably In Reliably

Understanding Error Budgets: The who, what, when and why of pre-incident reporting.

View Video

Reliably

DevOps
SRE

Read more about SRE in the Wild: Great Development Culture through Error Budgets

SRE Availability Metrics

May 17, 2021 By John Hasinsky In PagerTree

How available is your website, service, or platform? What must you monitor and measure to ensure availability? How do you translate uptime into availability? This chart has numbers that every Site Reliability Engineer (SRE) should know. Below the chart, you will find answers to commonly asked questions about SRE and associated metrics.

Read Post

PagerTree

Read more about SRE Availability Metrics

A Day in the Life: Intelligent Observability at Work with our SRE, Dinesh

May 17, 2021 By Helen Beal In Moogsoft

When I asked Charlie for permission to attend this year’s AICon (virtual, natch) I thought it would be a shoo-in; learning’s part of my OKRs after all. But he never makes things easy and his ‘yes’ came with a caveat that’s typical when dealing with him. This time, he claimed he didn’t have the budget for the ticket (a likely story!) and I’d have to find another way to get one.

Read Post

Moogsoft

Read more about A Day in the Life: Intelligent Observability at Work with our SRE, Dinesh

Don't be a victim of your own success: Using Service Levels to give a Consistent User Experience

May 17, 2021 By Reliably In Reliably

Slides available here: https://www.slideshare.net/russmiles/dont-be-a-victim-of-your-own-success-using-service-levels-to-give-a-consistent-user-experience

View Video

Reliably

DevOps
SRE

Read more about Don't be a victim of your own success: Using Service Levels to give a Consistent User Experience

Service Level Objectives and SRE: Service Level Overkill with Mick Roper

May 17, 2021 By Reliably In Reliably

How to calculate the impact on service level objectives, and how to harden them.

View Video

Reliably

DevOps
SRE

Read more about Service Level Objectives and SRE: Service Level Overkill with Mick Roper

SRE vs. DevOps [Understanding Differences & Similarities]

May 17, 2021 By Emily Arnott In Blameless

Site Reliability Engineering (SRE) and DevOps share a goal of building a bridge between development and operations. We'll explore and compare both approaches. Wondering to yourself, which is better for your company, SRE or DevOps? Neither SRE or DevOps is “better,” exactly, since they’re similar yet different in a few key ways: SRE, or site reliability engineering, is a methodology developed by Google engineer Ben Treynor Sloss in 2003.

Read Post

Blameless

Read more about SRE vs. DevOps [Understanding Differences & Similarities]

Make your Onboarding Experience Better with a Murder Mystery Game

May 17, 2021 By Blameless Community In Blameless

Onboarding a new tool can be boring. Or stressful. Or both. When onboarding an incident response tool, it can be difficult to make sure that your team is getting the most from the experience. Do you opt for a run-of-the-mill meeting, or try to learn while in an incident? Neither option is ideal. That’s why Petal’s DevOps Engineer Michael Cole found a new way to get his team using Blameless for their incident response process.

Read Post

Blameless

Read more about Make your Onboarding Experience Better with a Murder Mystery Game

Practical Guide to SRE: Using SLOs to Increase Reliability

May 13, 2021 By Quentin Rousseau In Rootly

Service Level Objectives (SLOs) are a key component of any successful Site Reliability Engineering initiative. The question is, what are SLOs; and how do you determine what your SLOs should be? Once you've done that, how should you use them?

Read Post

Rootly

Read more about Practical Guide to SRE: Using SLOs to Increase Reliability

SRE Leaders Panel: Business Agility is what matters, SRE can help you get there

May 11, 2021 By Blameless Community In Blameless

Blameless recently had the privilege of hosting SRE leaders Garima Bajpai, Founder at Community of Practice - DevOps Canada and Jason Fraser, Delivery Lead at VMware Tanzu to discuss the value of crisis during incident response, the best and worst tech transformations they’ve seen, how reliability impacts the flow of value, and more.

Read Post

Blameless

Read more about SRE Leaders Panel: Business Agility is what matters, SRE can help you get there

SRE fundamentals 2021: SLIs vs. SLAs. vs SLOs

May 7, 2021 By Adrian Hilton In Google Operations

A big part of ensuring the availability of your applications is establishing and monitoring service-level metrics—something that our Site Reliability Engineering (SRE) team does every day here at Google Cloud. The end goal of our SRE principles is to improve services and in turn the user experience. The concept of SRE starts with the idea that metrics should be closely tied to business objectives. In addition to business-level SLAs, we also use SLOs and SLIs in SRE planning and practice.

Read Post

Google Operations

Read more about SRE fundamentals 2021: SLIs vs. SLAs. vs SLOs

4 Key Characteristics of Modern Monitoring

May 7, 2021 By Heather Miller In Circonus

Our previous post, “Monitoring for Success: What All SREs Need to Know,” discusses how today’s complex IT environments — virtualization, cloud computing, continuous delivery and integration — coupled with pressures to deploy faster while meeting demands for “always on” customer expectations – have placed greater strains on monitoring teams.

Read Post

Circonus

Read more about 4 Key Characteristics of Modern Monitoring

Practical Guide to SRE: Automating On-Call

May 6, 2021 By JJ Tang In Rootly

Let's all face it, on call work isn't fun. But it can be better. Even if you have to work on call, it would be nice to have at least some of the work done for you, before you drag yourself out of bed at 3am to respond to an incident.

Read Post

Rootly

Read more about Practical Guide to SRE: Automating On-Call

How Blameless Integrates with Prometheus

May 3, 2021 By Blameless Community In Blameless

Blameless is excited to announce a new source for monitoring data for your SLIs and SLOs. Prometheus is an open source monitoring and alerting solution which is highly customizable.

Read Post

Blameless

Read more about How Blameless Integrates with Prometheus

How Blameless Integrates with New Relic

May 3, 2021 By Blameless Community In Blameless

Blameless is excited to announce a new source for monitoring data for your SLIs and SLOs. New Relic is an observability platform that helps engineers instrument, analyze, troubleshoot, and optimize their entire software stack.

Read Post

Blameless

Read more about How Blameless Integrates with New Relic

How Blameless Integrates with Pingdom

May 3, 2021 By Blameless Community In Blameless

Blameless is excited to announce a new source for monitoring data for your SLIs and SLOs. Pingdom is a leading monitoring platform that allows users to monitor synthetically and with real user data both applications and infrastructure.

Read Post

Blameless

Read more about How Blameless Integrates with Pingdom

How Blameless Integrates with Datadog

May 3, 2021 By Blameless Community In Blameless

Blameless is excited to announce a new source for monitoring data for your SLIs and SLOs. Datadog is a monitoring and security platform for cloud applications. It brings together end-to-end traces, metrics, and logs to make applications, infrastructure, and third-party services observable.

Read Post

Blameless

Read more about How Blameless Integrates with Datadog

Improve your Reliability with Blameless SLOs, Now Generally Available

May 3, 2021 By Blameless Community In Blameless

Blameless is excited to announce that our SLO Manager is now generally available! SLO Manager is a new service added to the Blameless platform. This service helps SRE and engineering teams proactively make data-driven decisions about reliability efforts. According to a survey Blameless conducted, over 80% of organizations use SLOs or will in the next 1-2 years.

Read Post

Blameless

Read more about Improve your Reliability with Blameless SLOs, Now Generally Available

Operations | Monitoring | ITSM | DevOps | Cloud

May 2021

The 7 SRE Principles [And How to Put Them Into Practice]

Incident Management vs. Incident Response - What's the Difference?

What do site reliability engineers do?

Blameless Runbook Documentation is Now Generally Available!

SRE Culture [How to Build a Better Team]

The Incident Review: 4 Odd Incidents Caused by Animals

Resilience in Action Episode 7: Killing Ops with Tony Hansmann

SREview Issue #13 May 2021

SRE in the Wild: Great Development Culture through Error Budgets

SRE Availability Metrics

A Day in the Life: Intelligent Observability at Work with our SRE, Dinesh

Don't be a victim of your own success: Using Service Levels to give a Consistent User Experience

Service Level Objectives and SRE: Service Level Overkill with Mick Roper

SRE vs. DevOps [Understanding Differences & Similarities]

Make your Onboarding Experience Better with a Murder Mystery Game

Practical Guide to SRE: Using SLOs to Increase Reliability

SRE Leaders Panel: Business Agility is what matters, SRE can help you get there

SRE fundamentals 2021: SLIs vs. SLAs. vs SLOs

4 Key Characteristics of Modern Monitoring

Practical Guide to SRE: Automating On-Call

How Blameless Integrates with Prometheus

How Blameless Integrates with New Relic

How Blameless Integrates with Pingdom

How Blameless Integrates with Datadog

Improve your Reliability with Blameless SLOs, Now Generally Available

Monthly Archive

Follow Us