Operations | Monitoring | ITSM | DevOps | Cloud

February 2021

QA Engineers, This is How SRE will Transform your Role

When implementing SRE, almost every role within your IT organization will change. One of the biggest transformations will be in your Quality Assurance teams. A common misconception is that SRE “replaces” QA. People believe SLOs and other SRE best practices render the traditional role of QA engineering obsolete, as testing and quality shift left in the SDLC. This leads to QA teams resisting SRE adoption.

Getting Started as an SRE? Here are 3 Things You Need to Know.

We live in the era of reliability. The most important feature for a service is how dependable it is in the eyes of a user. Companies are hiring with this in mind. In a 2019 LinkedIn article, site reliability engineers were listed as the 2nd most promising career in the United States. But how do you get started as an SRE? In this blog post, we’ll look at: SRE is a multifaceted role. You will contribute to an organization's code base, policy, culture, and more.

4 Things you Need to Know about Writing Better Production Readiness Checklists

When we think of reliability tools, we may overlook the humble checklist. While tools like SLOs represent the cutting edge of SRE, checklists have been recommended in many industries such as surgery and aviation for almost a century. But checklists owe this long and widespread adoption to their usefulness. Checklists can also help limit errors when deploying code to production. In this blog post, we’ll cover: Production checklists should be holistic.

4 Tips on Preparing for a [Great] Failure

The most essential lesson of SRE is that failure is inevitable. This shouldn’t be a cause for despair. SRE shows how embracing failure is empowering. By celebrating failure, you can accelerate development and foster a culture of learning. Rather than hoping to prevent failure, SRE prepares you to respond well to it. It can be difficult, if not impossible, to anticipate where failure will occur in complex systems given unknown unknowns.

Communication Tool Down? Here are 3 Ways to Handle it

January 4th, 2021, the communication service Slack suffered a major outage. Teams working remotely found their primary communication method unavailable. The incident lasted over 4 hours, during which some customers had intermittent or delayed service, and others had no service at all. It was a reminder that even the most established tools are susceptible to downtime. This is a core lesson of SRE: that failure is inevitable.

"I'm Just Doing my Job," An SRE Myth

"Sorry, but I'm just doing my job." I heard this recently from a customer service representative. What they were saying made sense (afterall, we don’t have total control over our work environments), but it felt wrong. As a customer, I was left dissatisfied with our interaction. However, the representative assured me that they were simply following protocol. This got me thinking: can established practices and protocols sometimes get in the way of excellent customer experience?