Latest Posts

How to Classify Incidents

Jul 8, 2020 By Emily Arnott In Blameless

Incident classification is a standardized way of organizing incidents with established categories. Incidents can include outages caused by errors in code, hardware failures, resource deficits — anything that disrupts normal operations. Each new incident should fit into a category dependent on the areas of the service affected, and in a ranking of the severity of the incident. Each of these classifications should have an established response procedure associated with it.

Read Post

Blameless

Read more about How to Classify Incidents

Google Cloud OnAir with CEO Ashar Rizqi: Benefits of Cloud Infrastructure

Jul 7, 2020 By Blameless Community In Blameless

CEO Ashar Rizqi had the pleasure of being a guest on Google Cloud OnAir, a Google Cloud Customer Interview Series. Ashar and interviewer Jimmy Sopko discussed how Blameless has extended our runway using Google Cloud and Google Kubernetes Engine and how the team cultivates a culture of site reliability in a changing world.

Read Post

Blameless

Read more about Google Cloud OnAir with CEO Ashar Rizqi: Benefits of Cloud Infrastructure

Blameless' SRE Journey

Jul 6, 2020 By Blameless Community In Blameless

SRE is a practice adopted by best-in-class companies all over the world. As a software reliability platform purpose-built for SREs, Blameless strives to practice what we preach and utilizes SRE best practices daily to cultivate a culture of resilience. However, this wasn’t always the case.

Read Post

Blameless

Read more about Blameless' SRE Journey

SRE Leaders Panel: Managing Systems Complexity

Jul 2, 2020 By Blameless Community In Blameless

In our previous panel, we spoke about how to overcome imposter syndrome in high tempo situations, and how culture directly affects the availability of our systems. Building on that last discussion, we gathered leading minds in the resilience industry to discuss how SRE can manage systems complexity, and how that's tightly intertwined with business health especially in the context of current health and social crises.

Read Post

Blameless

Read more about SRE Leaders Panel: Managing Systems Complexity

SLO Adoption at Twitter

Jul 1, 2020 By Blameless Community In Blameless

This is the second article of a two-part series. Click here for part 1 of the interview with Brian, Carrie, JP, and Zac to learn more about Twitter’s SRE journey. Previously, we saw how SRE at Twitter has transformed their engineering practice to drive production readiness at scale. The concept of service level objectives (SLOs) and error budgets have been key to this transformation, as SLOs shape an organization’s ability to make data-oriented decisions around reliability.

Read Post

Blameless

Read more about SLO Adoption at Twitter

Twitter's Reliability Journey

Jun 30, 2020 By Blameless Community In Blameless

Twitter’s SRE team is one of the most advanced in the industry, managing the services that capture the pulse of the world every single day and throughout the moments that connect us all. We had the privilege of interviewing Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zac Kiehl, Sr. Staff SRE to learn about how SRE is practiced at Twitter.

Read Post

Blameless

Read more about Twitter's Reliability Journey

How SLIs Help You Understand Users' Needs

Jun 29, 2020 By Emily Arnott In Blameless

In our article on SLOs, we discussed the need for service level indicators to be relevant to the users’ experience. By consolidating a number of internal metrics into one indicator that reflects the typical use of the service, we can ensure that meeting our SLO means keeping users happy. A good way to think about this is by looking at the user’s experience or journey.

Read Post

Blameless

Read more about How SLIs Help You Understand Users' Needs

Reduce Engineering Problems with a Resiliency Mindset

Jun 26, 2020 By Hannah Culver In Blameless

Resiliency isn’t something that just happens; it’s a result of dedication and hard work. To reach your optimal state of resilience, there are some crucial SRE best practices you should adopt to strengthen your processes.

Read Post

Blameless

Read more about Reduce Engineering Problems with a Resiliency Mindset

Top Practices for Runbook Automation

Jun 26, 2020 By Emily Arnott In Blameless

Runbooks, also known as playbooks, are documents that walk you through a certain task with specific steps. For example, a runbook for spinning up a new server might ask some questions about the purpose of the server and its estimated load, then lead you to the appropriate instructions and settings. Runbooks ease the cognitive load of these common tasks by clearly outlining the process for each.

Read Post

Blameless

Read more about Top Practices for Runbook Automation

SRE: A Human Approach to Systems

Jun 25, 2020 By Hannah Culver In Blameless

In the world of technology, the stakes have never been higher. The move to the cloud and microservices to maximize agility has given way to digital disruptors and unprecedented competitive threats. As distributed systems become increasingly complex, the scale of ‘unknown unknowns’ increases. On top of this, customer expectations are sky-high. The cost of downtime is catastrophic, with customers willing to churn if their needs are not promptly met.

Read Post

Blameless

Read more about SRE: A Human Approach to Systems

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

How to Classify Incidents

Google Cloud OnAir with CEO Ashar Rizqi: Benefits of Cloud Infrastructure

Blameless' SRE Journey

SRE Leaders Panel: Managing Systems Complexity

SLO Adoption at Twitter

Twitter's Reliability Journey

How SLIs Help You Understand Users' Needs

Reduce Engineering Problems with a Resiliency Mindset

Top Practices for Runbook Automation

SRE: A Human Approach to Systems

Monthly Archive

Follow Us