July 2020

How to Improve On-Call with Better Practices and Tools

Jul 30, 2020 By Emily Arnott In Blameless

In the era of reliability, where mere minutes of downtime or latency can cost hundreds of thousands of dollars, 24x7 availability and on-call coverage to respond to incidents has become a requirement for the vast majority of organizations. But setting up an on-call system that drives effective incident response while minimizing the stress placed on engineers isn’t a trivial task.

Read Post

Blameless

Read more about How to Improve On-Call with Better Practices and Tools

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

Jul 29, 2020 By Blameless Community In Blameless

Jacob Scott is an experienced engineer and enthusiastic participant in the resilience engineering community, having spent time caring for the technology systems powering high-growth startups as well as unicorns like Lyft and Stripe. He is deeply passionate about how to apply learnings from modern safety science to real, complex socio-technical systems.

Read Post

Blameless

Read more about Enabling the Stripe and Lyft Platforms Through Modern Safety Science

How to Choose Monitoring Tools for DevOps and SRE

Jul 23, 2020 By Emily Arnott In Blameless

When developing for reliability or implementing resilient DevOps practices, the heart of your decision-making is data. Without carefully monitoring key metrics like uptime, network load, and resource usage, you’ll be blind to where to spend development efforts or refine operation practices. Fortunately, a wide variety of monitoring tools are available to help you collect and get visibility into this data.

Read Post

Blameless

DevOps
SRE

Read more about How to Choose Monitoring Tools for DevOps and SRE

Leaders, Here's how to Encourage Full Service Ownership

Jul 22, 2020 By Hannah Culver In Blameless

Service ownership is becoming common practice and its benefits are well-known. These perks include happier customers, aligned teams, and fewer incidents. While this sounds great, it’s often easier said than done, requiring a culture and mindset shift. Leadership will need to encourage and empower teams to adopt the “you build it, you run it” mentality. Here are some ways leaders can help get teams on board.

Read Post

Blameless

Read more about Leaders, Here's how to Encourage Full Service Ownership

SREview Issue #3 July 2020

Jul 21, 2020 By Blameless Community In Blameless

Here’s the July issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.

Read Post

Blameless

Read more about SREview Issue #3 July 2020

How SLOs Help Your Team with Service Ownership

Jul 21, 2020 By Hannah Culver In Blameless

Service ownership is becoming a best practice for teams looking to innovate while maintaining the level of reliability that customers expect. Service ownership means seeing the service through its entire lifecycle. In short, it means you build it, you run it. You’ll be responsible for the service’s security, reliability, performance, and quality. This doesn’t mean you won’t have help from SREs to optimize or automate toil.

Read Post

Blameless

Read more about How SLOs Help Your Team with Service Ownership

Webinar: Modern Metrics to Understand Operational Health

Jul 21, 2020 By Blameless In Blameless

In this webinar, you'll learn what are the SRE metrics to better gain insights into operations health. We walk through common challenges and pain points in understanding operations health, metrics to measure based on your maturity journey, and a live demo to show solutions in action.

View Video

Blameless

Read more about Webinar: Modern Metrics to Understand Operational Health

The Essential List of Top SRE Resources

Jul 17, 2020 By Emily Arnott In Blameless

Are you looking to get up to speed on SRE fundamentals with the best SRE books and best DevOps books? Or are you hoping to expand your SRE knowledge into new domains? Either way, we’ve got you covered in our list of essential SRE resources!

Read Post

Blameless

Read more about The Essential List of Top SRE Resources

5 Tips for Getting Alert Fatigue Under Control

Jul 16, 2020 By Hannah Culver In Blameless

What happens when you receive a notification that something is wrong with your system and you have no clue what it means, or why you’re receiving that alert? Maybe you have to parse through the alert conditions to suss out what the alert indicates, or maybe you need to ping a coworker and ask. Not knowing what to do with an alert also contributes to alert fatigue, because it increases the toil and time required to respond.

Read Post

Blameless

Read more about 5 Tips for Getting Alert Fatigue Under Control

Leadership and Innovation with Instacart's VP of Infrastructure

Jul 15, 2020 By Blameless Community In Blameless

Blameless CEO Ashar Rizqi recently had the pleasure of interviewing Dustin Pearce in a virtual executive fireside chat and AMA. Dustin is an experienced leader in scaling hyper-growth, cloud-native companies, as the VP of Infrastructure at Instacart and having previously served as Head of Service Engineering at Slack.

Read Post

Blameless

Read more about Leadership and Innovation with Instacart's VP of Infrastructure

Promoting Continuous Learning with SRE

Jul 14, 2020 By Hannah Culver In Blameless

With the extreme changes we’ve all been through these last several months, it should come as no surprise that our jobs have changed drastically, too. We’re working remotely. We’re dealing with increased resource constraints. Our services are receiving more traffic than usual, and we’re tasked with keeping things up and running. Our work-as-done may not match what we did at the beginning of 2020.

Read Post

Blameless

Read more about Promoting Continuous Learning with SRE

Teamwork and Culture in the Era of Remote Work

Jul 13, 2020 By Hannah Culver In Blameless

With decreased resources, increased stress and cognitive load, and social distancing policies, many teams are under extreme pressure. Without over-communication and special attention paid to organizational culture, teams can become fractured, anxious, or disillusioned.

Read Post

Blameless

Read more about Teamwork and Culture in the Era of Remote Work

Using Automation and SLOs to Create Margin in your Systems

Jul 10, 2020 By Hannah Culver In Blameless

With the difficulties we’re facing during this time, it can be difficult to keep up with the increasingly vast demand for our services. You need to make use of all the tools in your toolbelt in order to conserve your team’s cognitive resources. Two ways you can do this are through automating toil from your processes and prioritizing with SLOs.

Read Post

Blameless

Read more about Using Automation and SLOs to Create Margin in your Systems

Minimizing SPOFs During Summer Slowdown

Jul 9, 2020 By Hannah Culver In Blameless

Between COVID-19 and the typical summer slow down, offices are emptier than they’re ever been. With team members taking some much-needed time off, it’s important to know how your team will be affected. Here are some tips to help your teams function during this time of flux.

Read Post

Blameless

Read more about Minimizing SPOFs During Summer Slowdown

How to Classify Incidents

Jul 8, 2020 By Emily Arnott In Blameless

Incident classification is a standardized way of organizing incidents with established categories. Incidents can include outages caused by errors in code, hardware failures, resource deficits — anything that disrupts normal operations. Each new incident should fit into a category dependent on the areas of the service affected, and in a ranking of the severity of the incident. Each of these classifications should have an established response procedure associated with it.

Read Post

Blameless

Read more about How to Classify Incidents

Google Cloud OnAir with CEO Ashar Rizqi: Benefits of Cloud Infrastructure

Jul 7, 2020 By Blameless Community In Blameless

CEO Ashar Rizqi had the pleasure of being a guest on Google Cloud OnAir, a Google Cloud Customer Interview Series. Ashar and interviewer Jimmy Sopko discussed how Blameless has extended our runway using Google Cloud and Google Kubernetes Engine and how the team cultivates a culture of site reliability in a changing world.

Read Post

Blameless

Read more about Google Cloud OnAir with CEO Ashar Rizqi: Benefits of Cloud Infrastructure

Blameless' SRE Journey

Jul 6, 2020 By Blameless Community In Blameless

SRE is a practice adopted by best-in-class companies all over the world. As a software reliability platform purpose-built for SREs, Blameless strives to practice what we preach and utilizes SRE best practices daily to cultivate a culture of resilience. However, this wasn’t always the case.

Read Post

Blameless

Read more about Blameless' SRE Journey

SRE Leaders Panel: Managing Systems Complexity

Jul 2, 2020 By Blameless Community In Blameless

In our previous panel, we spoke about how to overcome imposter syndrome in high tempo situations, and how culture directly affects the availability of our systems. Building on that last discussion, we gathered leading minds in the resilience industry to discuss how SRE can manage systems complexity, and how that's tightly intertwined with business health especially in the context of current health and social crises.

Read Post

Blameless

Read more about SRE Leaders Panel: Managing Systems Complexity

Getting the Most Out of SRE, SLOs, and Error Budgets with Joseph Bironas at Collective Health

Jul 2, 2020 By Blameless In Blameless

Joseph Bironas shares the often-overlooked but critical insights to answer these questions. Joseph has 14 years of experience in SRE, 12 of which at Google. His insider's insights are uniquely incisive, multi-disciplinary, and empathetic, linking the significance of SRE to both business and engineering.

View Video

Blameless

Incident Management

Read more about Getting the Most Out of SRE, SLOs, and Error Budgets with Joseph Bironas at Collective Health

SLO Adoption at Twitter

Jul 1, 2020 By Blameless Community In Blameless

This is the second article of a two-part series. Click here for part 1 of the interview with Brian, Carrie, JP, and Zac to learn more about Twitter’s SRE journey. Previously, we saw how SRE at Twitter has transformed their engineering practice to drive production readiness at scale. The concept of service level objectives (SLOs) and error budgets have been key to this transformation, as SLOs shape an organization’s ability to make data-oriented decisions around reliability.

Read Post

Blameless

Read more about SLO Adoption at Twitter

Operations | Monitoring | ITSM | DevOps | Cloud

July 2020

How to Improve On-Call with Better Practices and Tools

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

How to Choose Monitoring Tools for DevOps and SRE

Leaders, Here's how to Encourage Full Service Ownership

SREview Issue #3 July 2020

How SLOs Help Your Team with Service Ownership

Webinar: Modern Metrics to Understand Operational Health

The Essential List of Top SRE Resources

5 Tips for Getting Alert Fatigue Under Control

Leadership and Innovation with Instacart's VP of Infrastructure

Promoting Continuous Learning with SRE

Teamwork and Culture in the Era of Remote Work

Using Automation and SLOs to Create Margin in your Systems

Minimizing SPOFs During Summer Slowdown

How to Classify Incidents

Google Cloud OnAir with CEO Ashar Rizqi: Benefits of Cloud Infrastructure

Blameless' SRE Journey

SRE Leaders Panel: Managing Systems Complexity

Getting the Most Out of SRE, SLOs, and Error Budgets with Joseph Bironas at Collective Health

SLO Adoption at Twitter

Monthly Archive

Follow Us