Latest Posts

Here are the Metrics you Need to Understand Operational Health

Aug 19, 2020 By Blameless Community In Blameless

In recent polls we’ve conducted with engineers and leaders, we’ve found that around 70% of participants used MTTA and MTTR as one of their main metrics. 20% of participants cited looking at planned versus unplanned work, and 10% said they currently look at no metrics. While MTTA and MTTR are good starting points, they're no longer enough. With the rise in complexity, it can be difficult to gain insights into your services’ operational health.

Read Post

Blameless

Read more about Here are the Metrics you Need to Understand Operational Health

Resilience in Action, E5: Tammy Bryant and Eric Roberts The Importance of Glue Work

Aug 14, 2020 By Blameless Community In Blameless

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Blameless Staff SRE Amy Tobey. Amy has been an SRE and DevOps practitioner since before those names existed. She cares deeply about her community of SREs and wants to take what she’s learned over the 20+ years of her career to help others.

Read Post

Blameless

Read more about Resilience in Action, E5: Tammy Bryant and Eric Roberts The Importance of Glue Work

Choosing the Right SRE Tools

Aug 13, 2020 By Emily Arnott In Blameless

Implementing SRE practices and culture can be challenging. Fortunately, there are a variety of tools for each aspect of SRE: monitoring, SLOs and error budgeting, incident management, incident retrospectives, alerting, chaos engineering, and more. In this blog, we’ll talk about what to look for in an SRE tool, and how they’ll help you on your journey to reliability excellence.

Read Post

Blameless

Read more about Choosing the Right SRE Tools

Look Upstream to Solve your Team's Reliability Issues

Aug 12, 2020 By Hannah Culver In Blameless

In “Upstream” by Dan Health, we explore a variety of different problems ranging from homelessness, to high school graduation rates, to the state of sidewalks in different neighborhoods within the same city. In each of these examples, Dan discusses how upstream thinking decreased downstream work. Upstream thinking is characterized as proactive, collective actions to improve outcomes rather than reactions after an issue has already occurred.

Read Post

Blameless

Read more about Look Upstream to Solve your Team's Reliability Issues

The Importance of Reliability Engineering

Aug 6, 2020 By Emily Arnott In Blameless

If you’ve spent any time in tech circles lately, there are three letters you’ve surely heard: SRE. Site Reliability Engineering is the defining movement in tech today. Giants like Google and Amazon market their ability to provide reliable service and startups are now investing in reliability as an early priority. But what makes reliability engineering so important?

Read Post

Blameless

Read more about The Importance of Reliability Engineering

Improving Postmortems from Chores to Masterclass with Paul Osman

Aug 5, 2020 By Blameless Community In Blameless

In our 2019 Blameless Summit, Paul Osman spoke about how to take postmortems or incident retrospectives to a new level. ‍The following transcript has been lightly edited for clarity. Slides from this talk are available here. Paul Osman: I lead the SRE team at Under Armour. Who here knows about Under Armour as a tech company? Does anybody think about Under Armour as a tech company? Under Armour makes athletic attire, shirts and shoes.

Read Post

Blameless

Read more about Improving Postmortems from Chores to Masterclass with Paul Osman

How to Bring Operational Experience to your Development with Github's Lauren Rubin

Aug 4, 2020 By Blameless Community In Blameless

At the 2019 Blameless Summit, Lauren Rubin spoke about how to bring operational expertise to development teams. The following transcript has been lightly edited for clarity. Lauren Ruben: I was going to ask for a show of hands of how many people here who are on call right this minute right now. I am actually on call right this minute. I like to live dangerously. If my phone beeps, the specific noise that means I have been paged, I'm sorry, I am going to look at it.

Read Post

Blameless

Read more about How to Bring Operational Experience to your Development with Github's Lauren Rubin

How to Improve On-Call with Better Practices and Tools

Jul 30, 2020 By Emily Arnott In Blameless

In the era of reliability, where mere minutes of downtime or latency can cost hundreds of thousands of dollars, 24x7 availability and on-call coverage to respond to incidents has become a requirement for the vast majority of organizations. But setting up an on-call system that drives effective incident response while minimizing the stress placed on engineers isn’t a trivial task.

Read Post

Blameless

Read more about How to Improve On-Call with Better Practices and Tools

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

Jul 29, 2020 By Blameless Community In Blameless

Jacob Scott is an experienced engineer and enthusiastic participant in the resilience engineering community, having spent time caring for the technology systems powering high-growth startups as well as unicorns like Lyft and Stripe. He is deeply passionate about how to apply learnings from modern safety science to real, complex socio-technical systems.

Read Post

Blameless

Read more about Enabling the Stripe and Lyft Platforms Through Modern Safety Science

How to Choose Monitoring Tools for DevOps and SRE

Jul 23, 2020 By Emily Arnott In Blameless

When developing for reliability or implementing resilient DevOps practices, the heart of your decision-making is data. Without carefully monitoring key metrics like uptime, network load, and resource usage, you’ll be blind to where to spend development efforts or refine operation practices. Fortunately, a wide variety of monitoring tools are available to help you collect and get visibility into this data.

Read Post

Blameless

DevOps
SRE

Read more about How to Choose Monitoring Tools for DevOps and SRE

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Here are the Metrics you Need to Understand Operational Health

Resilience in Action, E5: Tammy Bryant and Eric Roberts The Importance of Glue Work

Choosing the Right SRE Tools

Look Upstream to Solve your Team's Reliability Issues

The Importance of Reliability Engineering

Improving Postmortems from Chores to Masterclass with Paul Osman

How to Bring Operational Experience to your Development with Github's Lauren Rubin

How to Improve On-Call with Better Practices and Tools

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

How to Choose Monitoring Tools for DevOps and SRE

Monthly Archive

Follow Us