SRE

The latest News and Information on Service Reliability Engineering and related technologies.

SRE Roles and Responsibilities Defined

Jul 6, 2022 By Myra Nizami In Blameless

SRE is a practice that creates a bridge between operations and development. We discuss the roles and responsibilities of a site reliability engineer.

Read Post

Blameless

Read more about SRE Roles and Responsibilities Defined

Improving Team Health With Reliably

Jul 2, 2022 By Sylvain Hellegouarch In Reliably

Improving team health within DevOps is vital for success in any engineering team. In this article, we’ll look at some of the ways that you can improve team health with Reliably so you can keep your developers happier, healthier and free from burnout.

Read Post

Reliably

Read more about Improving Team Health With Reliably

Top Five Pitfalls of On-Call Scheduling

Jun 30, 2022 By Squadcast Community In Squadcast

On-call schedules ensure that there's someone available day and night to fix or escalate any issues that arise. Using an on-call schedule helps keep things running smoothly. These on-call workers can be anyone from nurses and doctors required to respond to emergencies to IT and software engineering staff who need to fix service outages or significant bugs. Being on-call can be challenging and stressful. But with the proper practices in place, on-call schedules can fit well into an employee's work-life balance while still meeting the organization's needs.

Read Post

Squadcast

Read more about Top Five Pitfalls of On-Call Scheduling

Why More Incidents Are Better

Jun 30, 2022 By Andre King In Rootly

Ask most SREs how many incidents they’d have to respond to in a perfect world, and their answer would probably be “zero.” After all, making software and infrastructure so reliable that incidents never occur is the dream that SREs are theoretically chasing. Reducing actual incidents by as much as possible is a noble goal. However, it’s important to recognize that incidents aren’t an SRE’s number one enemy.

Read Post

Rootly

Read more about Why More Incidents Are Better

Are you doing SRE wrong? 4 questions to ask

Jun 29, 2022 By Auri Poso In Aiven

SRE requires teamwork and planning. Be like Aiven, get it right.

Read Post

Aiven

Read more about Are you doing SRE wrong? 4 questions to ask

Development Velocity (And How To Balance Reliability)

Jun 29, 2022 By Noor-ul-Anam Ruqayya In Blameless

Wondering about development velocity? We explain what development velocity is, how to measure it, and how to balance the need for fast development and reliable products.

Read Post

Blameless

Read more about Development Velocity (And How To Balance Reliability)

How Does Chaos Engineering Work?

Jun 28, 2022 By Aimee Pearcy In Reliably

Chaos testing is a way to test the integrity of a system. Its purpose is to simulate failures that could crash a production system in a controlled environment. This helps to identify failures before they cause unplanned downtime that disrupts the user experience. Unlike standard testing, which tests a system response against a predefined result, chaos testing does not have a predefined result. Rather, the entire purpose of the experiment is to find out new information about the system.

Read Post

Reliably

Read more about How Does Chaos Engineering Work?

Distributed Caching on Cloud

Jun 27, 2022 By Rajiv Srivastava In Squadcast

Distributed caching is an important aspect of cloud based applications, be it for on-premises, public or hybrid cloud environments. It facilitates incremental scaling, allowing the cache to grow and incorporate the data growth. In this blog we will explore distributed caching on cloud and why it is useful for environments with high data volume and load.

Read Post

Squadcast

Read more about Distributed Caching on Cloud

Lightstep Notebooks helps speed troubleshooting for SREs and developers

Jun 27, 2022 By Ben Sigelman In ServiceNow

Digital business is an imperative for 21st-century companies. Increasingly, organizations are directing investments toward technologies that deliver outcomes fast and enable more resilient digital business models. In this landscape, incidents such as software bugs, power outages, or downed networks have major consequences that affect both revenue and customer loyalty.

Read Post

ServiceNow

Read more about Lightstep Notebooks helps speed troubleshooting for SREs and developers

How To Prepare for a Site Reliability Engineer (SRE) Interview

Jun 27, 2022 By Stephen Watts In Splunk

Site reliability engineering continues to gain traction in software development and IT. SRE is at the crossroads of software development and IT operations. In Ben Treynor’s words, SRE is “what happens when you ask a software engineer to design an operations function.” Site reliability engineering is a way for developers to actively build services and functions to improve the resilience of people, processes and technical systems.

Read Post