June 2022

Top Five Pitfalls of On-Call Scheduling

Jun 30, 2022 By Squadcast Community In Squadcast

On-call schedules ensure that there's someone available day and night to fix or escalate any issues that arise. Using an on-call schedule helps keep things running smoothly. These on-call workers can be anyone from nurses and doctors required to respond to emergencies to IT and software engineering staff who need to fix service outages or significant bugs. Being on-call can be challenging and stressful. But with the proper practices in place, on-call schedules can fit well into an employee's work-life balance while still meeting the organization's needs.

Read Post

Squadcast

Read more about Top Five Pitfalls of On-Call Scheduling

Why More Incidents Are Better

Jun 30, 2022 By Andre King In Rootly

Ask most SREs how many incidents they’d have to respond to in a perfect world, and their answer would probably be “zero.” After all, making software and infrastructure so reliable that incidents never occur is the dream that SREs are theoretically chasing. Reducing actual incidents by as much as possible is a noble goal. However, it’s important to recognize that incidents aren’t an SRE’s number one enemy.

Read Post

Rootly

Read more about Why More Incidents Are Better

Are you doing SRE wrong? 4 questions to ask

Jun 29, 2022 By Auri Poso In Aiven

SRE requires teamwork and planning. Be like Aiven, get it right.

Read Post

Aiven

Read more about Are you doing SRE wrong? 4 questions to ask

Development Velocity (And How To Balance Reliability)

Jun 29, 2022 By Noor-ul-Anam Ruqayya In Blameless

Wondering about development velocity? We explain what development velocity is, how to measure it, and how to balance the need for fast development and reliable products.

Read Post

Blameless

Read more about Development Velocity (And How To Balance Reliability)

How Does Chaos Engineering Work?

Jun 28, 2022 By Aimee Pearcy In Reliably

Chaos testing is a way to test the integrity of a system. Its purpose is to simulate failures that could crash a production system in a controlled environment. This helps to identify failures before they cause unplanned downtime that disrupts the user experience. Unlike standard testing, which tests a system response against a predefined result, chaos testing does not have a predefined result. Rather, the entire purpose of the experiment is to find out new information about the system.

Read Post

Reliably

Read more about How Does Chaos Engineering Work?

Lightstep Notebooks helps speed troubleshooting for SREs and developers

Jun 27, 2022 By Ben Sigelman In ServiceNow

Digital business is an imperative for 21st-century companies. Increasingly, organizations are directing investments toward technologies that deliver outcomes fast and enable more resilient digital business models. In this landscape, incidents such as software bugs, power outages, or downed networks have major consequences that affect both revenue and customer loyalty.

Read Post

ServiceNow

Read more about Lightstep Notebooks helps speed troubleshooting for SREs and developers

How To Prepare for a Site Reliability Engineer (SRE) Interview

Jun 27, 2022 By Stephen Watts In Splunk

Site reliability engineering continues to gain traction in software development and IT. SRE is at the crossroads of software development and IT operations. In Ben Treynor’s words, SRE is “what happens when you ask a software engineer to design an operations function.” Site reliability engineering is a way for developers to actively build services and functions to improve the resilience of people, processes and technical systems.

Read Post

Splunk

Read more about How To Prepare for a Site Reliability Engineer (SRE) Interview

Eliminating Toil In SRE

Jun 27, 2022 By Mbaoma Mary In Reliably

Toil is a term coined by Google which describes the repetitive and tedious tasks associated with running a production service. Toil tends to be manual and devoid of any long-term value. Toil is not just ‘work I do not like to do’. Each time an engineer engages with a production system, it represents time devoted to toil. These types of tasks get worse as your service grows even more extensive. Site Reliability Engineers (SRE) should spend less time on toil.

Read Post

Reliably

Read more about Eliminating Toil In SRE

Distributed Caching on Cloud

Jun 27, 2022 By Rajiv Srivastava In Squadcast

Distributed caching is an important aspect of cloud based applications, be it for on-premises, public or hybrid cloud environments. It facilitates incremental scaling, allowing the cache to grow and incorporate the data growth. In this blog we will explore distributed caching on cloud and why it is useful for environments with high data volume and load.

Read Post

Squadcast

Read more about Distributed Caching on Cloud

Webinar Recap: How to Avoid Being On Call With Under-Instrumented Tools

Jun 24, 2022 By Jessica Kerr In Honeycomb

“It’s too expensive!” “Do we really need another tool?” “Our APM works just fine.” With strapped tech budgets and an abundance of tooling, it can be hard to justify a new expense—or something new for engineers to learn. Especially when they feel their current tool does the job adequately. But, does it?

Read Post

Honeycomb

Read more about Webinar Recap: How to Avoid Being On Call With Under-Instrumented Tools

Product Roundup: New Blameless Features in June 2022

Jun 22, 2022 By Phoebe Wang In Blameless

Summer means things are heating up. And things are definitely heating up at Blameless! We’ve been hard at work delivering new features and capabilities to our customers, so today I wanted to share a quick summary of all the latest. Here are 4 exciting product updates that enhance the way teams manage incidents and deliver reliable products to their customers.

Read Post

Blameless

Read more about Product Roundup: New Blameless Features in June 2022

Making Reliability A Critical Service Of An Organization

Jun 21, 2022 By Aimee Pearcy In Reliably

As systems continue to become more complex, reliability is becoming an increasingly important requirement. Organizations are quickly realizing that making reliability a critical part of their service means that other organizations will be less likely to cut costs on them. As a result of this, the field of service reliability engineering (SRE) has grown rapidly over the past few years.

Read Post

Reliably

Read more about Making Reliability A Critical Service Of An Organization

How to Establish Service Level Objectives In Software Engineering

Jun 19, 2022 By Samadrita Ghosh In Reliably

SLOs or Service Level Objectives are the foundation of Site Reliability Engineering (SRE). To correctly understand SLOs, the first step is to understand Service Level Indicators or SLIs. SLIs are metrics that measure the vitals of the service. These vitals are chosen based on two conditions. First, they are the features that the user is primarily concerned about. Second, they allow the engineering team to get an overview of the system’s health.

Read Post

Reliably

Read more about How to Establish Service Level Objectives In Software Engineering

Continuous Validation: What Is It And Why Is It Important?

Jun 18, 2022 By Catrin Haberfield In Reliably

By investing in a CI/CD pipeline, it’s entirely possible to automate a large part of the software development life cycle – letting businesses deliver high-quality, high-efficiency outputs with a faster time to market. But there are multiple elements to the CI/CD process, including the all-seeing eye that is continuous validation. So what exactly is continuous validation, and why should software developers bother to engage with it?

Read Post

Reliably

Read more about Continuous Validation: What Is It And Why Is It Important?

Continuous Documentation In A CI/CD World

Jun 18, 2022 By Aimee Pearcy In Reliably

Continuous documentation is the process of creating and maintaining code documentation incrementally throughout a project in a way that seamlessly incorporates it into the development workflow. It is a key part of improving reliability within an organization. It’s not just new features that need to be documented – anything useful from bug fixes, to how to get started using the code should be documented. It should also be updated frequently to ensure that it stays relevant.

Read Post

Reliably

Read more about Continuous Documentation In A CI/CD World

How To Build High-Performing Engineering Teams

Jun 18, 2022 By Charity Majors In Reliably

There is a distinctive gap opening up between the top engineers and the rest. The elite engineers represent the top few percent of engineering teams and are making incredible gains year on year in velocity, reliability, and human compatibility, whilst the bottom 50% are losing ground. The loss has nothing to do with engineering ability.

Read Post

Reliably

Read more about How To Build High-Performing Engineering Teams

Demo Day | Discover Developer-First Reliability

Jun 16, 2022 By Reliably In Reliably

Join us at this very first edition of Demo Days where this month we’ll be meeting the team at Reliably who will give a live product demo showcasing how their platform can help you to get better at operating with greater predictability and less anxiety. See Reliably in action and discover developer-first reliability as one of their experts will guide us through the product and its features.

View Video

Reliably

DevOps
SRE

Read more about Demo Day | Discover Developer-First Reliability

The value of blameless culture - from IC to C-Suite

Jun 16, 2022 By Tyler McGoffin In CircleCI

At CircleCI, CI has a second meaning: Continuous Improvement. We continuously seek out feedback not only to improve our code but to improve our processes and get better at our jobs along the way. This Continuous Improvement starts with one important company value: a blameless culture. Our blameless culture extends into every part of how we operate.

Read Post

CircleCI

Read more about The value of blameless culture - from IC to C-Suite

Retrospective Template (What They Are & How To Use One)

Jun 16, 2022 By Myra Nizami In Blameless

Wondering about retrospective templates? We give a complete explanation of what a retrospective should include and best practices for using one.

Read Post

Blameless

Read more about Retrospective Template (What They Are & How To Use One)

Site Reliability Engineering (SRE) Survey Now Open for 2022 - Calling All Reliability Practitioners and Leaders

Jun 14, 2022 By Catchpoint In Catchpoint

In its fifth year, Catchpoint sponsors The SRE Survey, in partnership with Blameless, to uncover new trends and challenges for teams focused on advancing the reliability of digital products.

Read Post

Catchpoint

Read more about Site Reliability Engineering (SRE) Survey Now Open for 2022 - Calling All Reliability Practitioners and Leaders

Squadcast Product Demo | Incident Management | On-call | SRE | Status Page | SLO Tracker | Runbooks

Jun 13, 2022 By Squadcast In Squadcast

This video explains why Squadcast is a feature-rich solution for SRE, DevOps, and Engineering teams in general. With the ability to help teams quickly mobilize response teams during critical incidents, easily manage on-call schedules, and track SLOs for better SRE, Squadcast is a multi-purpose platform with numerous capabilities. This short video covers everything the product is capable of.

View Video

Squadcast

Read more about Squadcast Product Demo | Incident Management | On-call | SRE | Status Page | SLO Tracker | Runbooks

OKR Culture: How To Build Service reliability With DevOps Teams & OKRs

Jun 12, 2022 By Mbaoma Mary In Reliably

OKR stands for Objectives and Key Results (OKRs). They are essential frameworks for establishing and monitoring goals and outcomes. They also facilitate discussions regarding the alignment of an employee’s job with the company’s objectives. Many companies such as Google use OKRs to improve engineering team culture and productivity. OKRs require a strong, open, and creative workplace culture to take root. OKRs offer you focus, alignment, commitment, and goal tracking.

Read Post

Reliably

Read more about OKR Culture: How To Build Service reliability With DevOps Teams & OKRs

SecDevOps: Understanding Shift Left Security

Jun 12, 2022 By Mika Boström In Reliably

No buzzwords were harmed in the making of this post Let’s take one of the most overloaded terms, DevOps, and mix it with the haziest of topics, security. What do you get, apart from confusion? SecDevOps. Or maybe it’s DevSecOps. If you’re not sure what either means, you’re not alone. Even the industry at large can’t decide what they should call it. And so they - we - came up with a new term altogether.

Read Post

Reliably

Read more about SecDevOps: Understanding Shift Left Security

Setting up Route 53 Health Checks

Jun 10, 2022 By Vishal Padghan In Squadcast

We live in an age where the internet and digital data drive modern day markets, which results in huge amounts of data being generated and consumed. Hence, it has become very important for online platforms to manage this traffic and serve their customers more efficiently. In this blog we will explore the Amazon Route 53 service and see how it addresses domain name system routing and health check problems.

Read Post

Squadcast

Read more about Setting up Route 53 Health Checks

Incident vs. Problem [Understanding the Differences]

Jun 9, 2022 By Myra Nizami In Blameless

Curious about incidents vs. problems? We explain the differences and how to handle each one. ‍

Read Post

Blameless

Read more about Incident vs. Problem [Understanding the Differences]

What Do You Monitor In A Distributed System?

Jun 9, 2022 By Mbaoma Mary In Reliably

Distributed systems are responsible for many different tasks and processes that need to be monitored and managed. In this article, we will explore what you should monitor in a distributed system, including network communication, resources, and performance.

Read Post

Reliably

Read more about What Do You Monitor In A Distributed System?

10 Ways You Can Improve System Reliability

Jun 8, 2022 By Aimee Pearcy In Reliably

System reliability is the probability that a system performs as it is expected to under a set of specified conditions throughout a specified period. Organizations use reliability engineering to help to make products more reliable in a cost-effective way. The key objectives of reliability engineering are to reduce the frequency of failures, identify the causes of failures and correct them, figure out ways of coping with failures when they do occur, and estimate the likely reliability of new designs.

Read Post

Reliably

Read more about 10 Ways You Can Improve System Reliability

Calling all Reliability Practitioners: Participate in the 2022 SRE Survey

Jun 8, 2022 By Kurt Andersen In Catchpoint

For the past four years, Catchpoint and various partners have been running a yearly SRE Survey. This year, Blameless is excited to partner with Catchpoint for the fifth annual survey. We want to hear from you if you are in a DevOps or SRE role or even if you work on reliability with some other title or role. There are tremendous, valuable learnings when we listen closely to practitioners.

Read Post

Catchpoint

Read more about Calling all Reliability Practitioners: Participate in the 2022 SRE Survey

Calling all Reliability Practitioners: Participate in the SRE Survey 2022

Jun 8, 2022 By Kurt Andersen In Blameless

Read Post

Blameless

Read more about Calling all Reliability Practitioners: Participate in the SRE Survey 2022

Incident Priority Matrix (Understanding Impact and Urgency)

Jun 7, 2022 By Noor-ul-Anam Ruqayya In Blameless

Curious about the incident priority matrix? We discuss how to determine the impact and urgency of an incident, and how to create a matrix that helps prioritize incidents.

Read Post

Blameless

Read more about Incident Priority Matrix (Understanding Impact and Urgency)

Squadcast + OSNexus QuantaStor Integration: Making Incident Management & Alerting more effective

Jun 2, 2022 By Vishal Padghan In Squadcast

Storage systems are an integral part of IT infrastructure. Given that modern markets are highly competitive and demanding, businesses strive for 24/7 availability. This in turn sets higher expectations for storage systems to be operational all the time. But just like other IT components, even storage systems are prone to incidents. Hence, it is important to have an efficient communication process, to manage alerts during system failures/disasters.

Read Post

Squadcast

Read more about Squadcast + OSNexus QuantaStor Integration: Making Incident Management & Alerting more effective

Software Engineers vs Site Reliability Engineering Explained

Jun 2, 2022 By Myra Nizami In Blameless

We discuss what software engineers and site reliability engineering are and explain their differences and their importance in the software development process.

Read Post

Blameless

Read more about Software Engineers vs Site Reliability Engineering Explained

5 Reliability Insights That Immediately Transform Your SRE

Jun 1, 2022 By Emily Arnott In Blameless

As infrastructure engineers, there’s so much you can learn from studying past incidents. Luckily, Blameless Reliability Insights helps you find patterns that better equip you to deal with incidents to come. If you’ve never used it before and you’re curious what it looks like, you can watch a video demo here! These statistical insights give you the power to learn everything you can when something goes wrong. ‍

Read Post

Blameless

Read more about 5 Reliability Insights That Immediately Transform Your SRE

Operations | Monitoring | ITSM | DevOps | Cloud

June 2022

Top Five Pitfalls of On-Call Scheduling

Why More Incidents Are Better

Are you doing SRE wrong? 4 questions to ask

Development Velocity (And How To Balance Reliability)

How Does Chaos Engineering Work?

Lightstep Notebooks helps speed troubleshooting for SREs and developers

How To Prepare for a Site Reliability Engineer (SRE) Interview

Eliminating Toil In SRE

Distributed Caching on Cloud

Webinar Recap: How to Avoid Being On Call With Under-Instrumented Tools

Product Roundup: New Blameless Features in June 2022

Making Reliability A Critical Service Of An Organization

How to Establish Service Level Objectives In Software Engineering

Continuous Validation: What Is It And Why Is It Important?

Continuous Documentation In A CI/CD World

How To Build High-Performing Engineering Teams

Demo Day | Discover Developer-First Reliability

The value of blameless culture - from IC to C-Suite

Retrospective Template (What They Are & How To Use One)

Site Reliability Engineering (SRE) Survey Now Open for 2022 - Calling All Reliability Practitioners and Leaders

Squadcast Product Demo | Incident Management | On-call | SRE | Status Page | SLO Tracker | Runbooks

OKR Culture: How To Build Service reliability With DevOps Teams & OKRs

SecDevOps: Understanding Shift Left Security

Setting up Route 53 Health Checks

Incident vs. Problem [Understanding the Differences]

What Do You Monitor In A Distributed System?

10 Ways You Can Improve System Reliability

Calling all Reliability Practitioners: Participate in the 2022 SRE Survey

Calling all Reliability Practitioners: Participate in the SRE Survey 2022

Incident Priority Matrix (Understanding Impact and Urgency)

Squadcast + OSNexus QuantaStor Integration: Making Incident Management & Alerting more effective

Software Engineers vs Site Reliability Engineering Explained

5 Reliability Insights That Immediately Transform Your SRE

Monthly Archive

Follow Us