April 2021

Monitoring for Success: What All SREs Need to Know

Apr 29, 2021 By Heather Miller In Circonus

The last ten years have seen a massive change in how IT operations and development enable business success. From virtualization and cloud computing to continuous delivery, continuous integration, and rapid application development, IT has never been more complex or more critical to creating competitive advantage. To support increasingly Web-Scale IT operations and wide-scale cloud adoption, applications now operate as services.

Read Post

Circonus

Read more about Monitoring for Success: What All SREs Need to Know

7 Ways SRE Is Changing IT Ops And How To Prepare For Those Changes

Apr 29, 2021 By Squadcast Community In Squadcast

SRE best practices are disrupting and catalyzing change in the ways organizations approach IT Operations. In this blog we look at 7 ways SRE is bringing this transition. ‍Site Reliability Engineering is a new practice that has been growing in popularity among many businesses. Also known as SRE, the new activity puts a premium on monitoring, tracking bugs, and creating systems and automations that solve the problem in the long term.

Read Post

Squadcast

Read more about 7 Ways SRE Is Changing IT Ops And How To Prepare For Those Changes

How Kubernetes Can Both Help and Hinder Incident Management Teams

Apr 29, 2021 By Quentin Rousseau In Rootly

Kubernetes makes it easier in certain ways to manage reliability. But incident response teams and SREs must also be prepared to handle the unique reliability challenges that K8s creates.

Read Post

Rootly

Read more about How Kubernetes Can Both Help and Hinder Incident Management Teams

SRE Leader Panel: Business Agility is what matters, SRE can help you get there

Apr 29, 2021 By Blameless In Blameless

Ready for another SRE Thought Leader Panel? This one is themed, Business Agility is what matters, SRE can help you get there. We’re chatting about topics like the value of crisis during incident response, the best and worst tech transformations we’ve seen, how reliability impacts the flow of value, and more. This panel is hosted by Chris Hendrix, staff software engineer at Blameless and features guests.

View Video

Blameless

Read more about SRE Leader Panel: Business Agility is what matters, SRE can help you get there

What is Site Reliability Engineering [Simple Intro to SRE]

Apr 26, 2021 By Emily Arnott In Blameless

Wondering what SRE is all about? We will explain what it is, how it works, why it was developed, and how it can help your organization. So what is SRE (Site Reliability Engineering)? SRE is a methodology that fuses software and operations teams, with the goal of producing reliable, resilient, and scalable systems. Site Reliability Engineering (SRE) was developed by Google engineer Ben Treynor Sloss in 2003. Google’s goal was to increase the reliability of its sites and services.

Read Post

Blameless

Read more about What is Site Reliability Engineering [Simple Intro to SRE]

4 Characteristics of Monitoring Essential to Implementing DevOps

Apr 23, 2021 By Theo Schlossnagle In Circonus

In the new world of rapid releases, continuous change, and increasingly high user expectations, more organizations are embracing DevOps. One of the primary drivers for adopting DevOps is speed — particularly the reduction of risk at speed. As DevOps seeks to reduce risk and deliver insight at an increasingly faster pace, new tools have emerged in the monitoring space. But these tools alone will not deliver us into the low-risk world of DevOps — not without new and updated thinking.

Read Post

Circonus

Read more about 4 Characteristics of Monitoring Essential to Implementing DevOps

Creating Chaos to Achieve Reliability

Apr 22, 2021 By JJ Tang In Rootly

How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.

Read Post

Rootly

Read more about Creating Chaos to Achieve Reliability

SREview Issue #12 April 2021

Apr 20, 2021 By Blameless Community In Blameless

Spring is here! We have rain! We have flowers! We have allergies! We also have some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this month.

Read Post

Blameless

Read more about SREview Issue #12 April 2021

Using Coralogix + StackPulse to Automatically Enrich Alerts and Manage Incidents

Apr 20, 2021 By Jonathan Brown In Coralogix

Keeping digital services reliable is more important than ever. When something goes wrong in production, on-call teams face significant pressure to identify and resolve the incident quickly – in order to keep customers happy. But it can be difficult to get the right signals to the right person in a timely fashion.

Read Post

Coralogix

Read more about Using Coralogix + StackPulse to Automatically Enrich Alerts and Manage Incidents

Resilience in Action E6: Oversize Coffee Mugs, SLOs, and ML with Todd Underwood

Apr 19, 2021 By Blameless Community In Blameless

‍Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.

Read Post

Blameless

Read more about Resilience in Action E6: Oversize Coffee Mugs, SLOs, and ML with Todd Underwood

Should You Be an SRE or a DevOps Engineer?

Apr 15, 2021 By Quentin Rousseau In Rootly

SREs may have better long-term job prospects, but DevOps might be an easier career to pursue.

Read Post

Rootly

Read more about Should You Be an SRE or a DevOps Engineer?

Creating Custom Slack Commands

Apr 15, 2021 By FireHydrant In FireHydrant

Site Reliability Engineers are expected to know everything that’s happening, all of the time. That’s a lot of things! To help you sift through the noise, we’ve developed a feature that lets you find accurate data about your organization on-demand. You can do this by sending custom-designed commands to FireHydrant directly from your integrated Slack account.

Read Post

FireHydrant

Read more about Creating Custom Slack Commands

Catchpoint Announces Virtual SRE Community Event on June 10

Apr 13, 2021 By Catchpoint In Catchpoint

'SRE From Anywhere' will be the largest community event for Site Reliability Engineers to learn and share best practices for delivering best digital performance.

Read Post

Catchpoint

Read more about Catchpoint Announces Virtual SRE Community Event on June 10

What are MTTx Metrics Good For? Let's Find Out.

Apr 13, 2021 By Emily Arnott In Blameless

Data helps best-in-class teams make the right decisions. Analyzing your system’s metrics shows you where to invest time and resources. A common type of metric is Mean Time to X, or MTTx. These metrics detail the average time it takes for something to happen. The “x” can represent events or stages in a system’s incident response process. Yet, MTTx metrics rarely tell the whole story of a system’s reliability.

Read Post

Blameless

Read more about What are MTTx Metrics Good For? Let's Find Out.

Having On-call Nightmares? Runbooks can Help you Wake Up.

Apr 12, 2021 By Harry Hull In Blameless

You aren't sure how long you've been here, but the view outside the window sure is soothing. Before you can fully take in your surroundings, a siren rips you back into the conscious world. Slowly, you begin to piece together that you exist, and you are on call. The ringing, much louder now, pierces through your skull as you begin to open your bleary eyes. You turn over your pillow, grab your phone, and click through the PagerDuty notification.

Read Post

Blameless

Read more about Having On-call Nightmares? Runbooks can Help you Wake Up.

How Would an SRE Conduct a Postmortem on the Suez Canal Incident?

Apr 7, 2021 By JJ Tang In Rootly

The Suez Canal has been big news over the last couple of weeks. We wondered how a Site Reliability Engineer (SRE) might conduct a postmortem on what happened with the Ever Given, and what that might mean if a comparable incident occurred at a modern tech company.

Read Post

Rootly

Read more about How Would an SRE Conduct a Postmortem on the Suez Canal Incident?

SRE Leaders Panel: SRE Adoption as Organizational Transformation

Apr 6, 2021 By Blameless Community In Blameless

Blameless recently had the privilege of hosting SRE leaders Kurt Andersen, SRE Architect at Blameless, Vanessa Yiu, Executive Director, Enterprise Architecture at Goldman Sachs, and Tony Hansmann, Former Global CTO at Pivotal Software, Inc.

Read Post

Blameless

Read more about SRE Leaders Panel: SRE Adoption as Organizational Transformation

How Netflix Uses Fault Injection To Truly Understand Their Resilience

Apr 6, 2021 By Thomas Russell In Coralogix

Distributed systems such as microservices have defined software engineering over the last decade. The majority of advancements have been in increasing resilience, flexibility, and rapidity of deployment at increasingly larger scales. For streaming giant Netflix, the migration to a complex cloud based microservices architecture would not have been possible without a revolutionary testing method known as fault injection. With tools like chaos monkey, Netflix employs a cutting edge testing toolkit.

Read Post

Coralogix

Read more about How Netflix Uses Fault Injection To Truly Understand Their Resilience

So you Want an SRE Tool. Do you Build, Buy, or Open Source?

Apr 5, 2021 By Emily Arnott In Blameless

As your organization’s reliability needs grow, you may consider investing in SRE tools. Tooling can make many processes more efficient, consistent, and repeatable. When you decide to invest in tooling, one of the major decisions is how you’ll source your tools. Will you buy an out-of-the-box tool, build one in-house, or work with an open source project? This is a big decision. Switching methods half-way through adoption is costly and can cause thrash.

Read Post

Blameless

Read more about So you Want an SRE Tool. Do you Build, Buy, or Open Source?

Product Update: Upgrade to Exporting your Retrospectives

Apr 2, 2021 By Blameless Community In Blameless

Blameless is excited to announce an enhancement to our Incident Retrospective tool! The Export feature now allows for customizable retrospectives.

Read Post

Blameless

Read more about Product Update: Upgrade to Exporting your Retrospectives

How SREs Can React to COVID-19's Impact on Incident Management

Apr 2, 2021 By Quentin Rousseau In Rootly

By adding new complexity to reliability engineering and making physical war rooms a thing of the past, COVID-19 has imposed permanent changes on incident management. Here’s how SREs can respond.

Read Post

Rootly

Read more about How SREs Can React to COVID-19's Impact on Incident Management

Operations | Monitoring | ITSM | DevOps | Cloud

April 2021

Monitoring for Success: What All SREs Need to Know

7 Ways SRE Is Changing IT Ops And How To Prepare For Those Changes

How Kubernetes Can Both Help and Hinder Incident Management Teams

SRE Leader Panel: Business Agility is what matters, SRE can help you get there

What is Site Reliability Engineering [Simple Intro to SRE]

4 Characteristics of Monitoring Essential to Implementing DevOps

Creating Chaos to Achieve Reliability

SREview Issue #12 April 2021

Using Coralogix + StackPulse to Automatically Enrich Alerts and Manage Incidents

Resilience in Action E6: Oversize Coffee Mugs, SLOs, and ML with Todd Underwood

Should You Be an SRE or a DevOps Engineer?

Creating Custom Slack Commands

Catchpoint Announces Virtual SRE Community Event on June 10

What are MTTx Metrics Good For? Let's Find Out.

Having On-call Nightmares? Runbooks can Help you Wake Up.

How Would an SRE Conduct a Postmortem on the Suez Canal Incident?

SRE Leaders Panel: SRE Adoption as Organizational Transformation

How Netflix Uses Fault Injection To Truly Understand Their Resilience

So you Want an SRE Tool. Do you Build, Buy, or Open Source?

Product Update: Upgrade to Exporting your Retrospectives

How SREs Can React to COVID-19's Impact on Incident Management

Monthly Archive

Follow Us