Operations | Monitoring | ITSM | DevOps | Cloud

June 2020

Twitter's Reliability Journey

Twitter’s SRE team is one of the most advanced in the industry, managing the services that capture the pulse of the world every single day and throughout the moments that connect us all. We had the privilege of interviewing Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zac Kiehl, Sr. Staff SRE to learn about how SRE is practiced at Twitter.

How SLIs Help You Understand Users' Needs

In our article on SLOs, we discussed the need for service level indicators to be relevant to the users’ experience. By consolidating a number of internal metrics into one indicator that reflects the typical use of the service, we can ensure that meeting our SLO means keeping users happy. A good way to think about this is by looking at the user’s experience or journey.

Top Practices for Runbook Automation

Runbooks, also known as playbooks, are documents that walk you through a certain task with specific steps. For example, a runbook for spinning up a new server might ask some questions about the purpose of the server and its estimated load, then lead you to the appropriate instructions and settings. Runbooks ease the cognitive load of these common tasks by clearly outlining the process for each.

SRE: A Human Approach to Systems

In the world of technology, the stakes have never been higher. The move to the cloud and microservices to maximize agility has given way to digital disruptors and unprecedented competitive threats. As distributed systems become increasingly complex, the scale of ‘unknown unknowns’ increases. On top of this, customer expectations are sky-high. The cost of downtime is catastrophic, with customers willing to churn if their needs are not promptly met.

Best Practices for Effective Incident Management

Incident management is a set of processes used by operations teams to respond to latency or downtime, and return a service to its normal state. Incident management practices have long been well-defined through frameworks such as ITIL, but as software systems become more complex, teams increasingly need to adapt their incident management processes accordingly.

Announcing our new integration with GoToMeeting

Communication during incidents is critical. With the rise of remote work, war rooms are no longer the central hub for all incident communication. Instead, we’re adapting to these new challenges and embracing video conferencing and messaging software in order to stay in tight lock-step with our teammates and collaborators. With this in mind, we are excited to announce that Blameless is adding a new way for you to communicate even faster and more effectively.

A Journey Through Blameless from Incident to Success

Here at Blameless, every aspect of our product has SLOs (Service Level Objects) and error budgets in order to help us understand and improve customer experience. Sometimes, these error budgets are at risk, triggering an incident. While incidents are often painful, we treat them as unplanned investments, striving to learn as much as we can from them. We empower all of our engineers to handle an on-call rotation, no matter how difficult the issue.

SRE Leaders Panel: Work as Done vs Work as Imagined

Blameless recently had the privilege of hosting some fantastic leaders in the SRE and resilience community for a panel discussion. Our panelists discussed the effects of imposter syndrome especially during high tempo situations, how to use it to our advantage and overcome doubt, and how culture directly affects the availability of our systems. The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.