Latest News

blameless

Leadership and Innovation with Instacart's VP of Infrastructure

Blameless CEO Ashar Rizqi recently had the pleasure of interviewing Dustin Pearce in a virtual executive fireside chat and AMA. Dustin is an experienced leader in scaling hyper-growth, cloud-native companies, as the VP of Infrastructure at Instacart and having previously served as Head of Service Engineering at Slack.

blameless

Promoting Continuous Learning with SRE

With the extreme changes we’ve all been through these last several months, it should come as no surprise that our jobs have changed drastically, too. We’re working remotely. We’re dealing with increased resource constraints. Our services are receiving more traffic than usual, and we’re tasked with keeping things up and running. Our work-as-done may not match what we did at the beginning of 2020.

blameless

Using Automation and SLOs to Create Margin in your Systems

With the difficulties we’re facing during this time, it can be difficult to keep up with the increasingly vast demand for our services. You need to make use of all the tools in your toolbelt in order to conserve your team’s cognitive resources. Two ways you can do this are through automating toil from your processes and prioritizing with SLOs.

blameless

How to Classify Incidents

Incident classification is a standardized way of organizing incidents with established categories. Incidents can include outages caused by errors in code, hardware failures, resource deficits — anything that disrupts normal operations. Each new incident should fit into a category dependent on the areas of the service affected, and in a ranking of the severity of the incident. Each of these classifications should have an established response procedure associated with it.

blameless

Google Cloud OnAir with CEO Ashar Rizqi: Benefits of Cloud Infrastructure

CEO Ashar Rizqi had the pleasure of being a guest on Google Cloud OnAir, a Google Cloud Customer Interview Series. Ashar and interviewer Jimmy Sopko discussed how Blameless has extended our runway using Google Cloud and Google Kubernetes Engine and how the team cultivates a culture of site reliability in a changing world.

blameless

SRE Leaders Panel: Managing Systems Complexity

In our previous panel, we spoke about how to overcome imposter syndrome in high tempo situations, and how culture directly affects the availability of our systems. Building on that last discussion, we gathered leading minds in the resilience industry to discuss how SRE can manage systems complexity, and how that's tightly intertwined with business health especially in the context of current health and social crises.

blameless

SLO Adoption at Twitter

This is the second article of a two-part series. Click here for part 1 of the interview with Brian, Carrie, JP, and Zac to learn more about Twitter’s SRE journey. Previously, we saw how SRE at Twitter has transformed their engineering practice to drive production readiness at scale. The concept of service level objectives (SLOs) and error budgets have been key to this transformation, as SLOs shape an organization’s ability to make data-oriented decisions around reliability.