%term

The latest News and Information on Service Reliability Engineering and related technologies.

Tales from the Toil: Taking the pulse of SRE

Aug 9, 2022 By Sam Fell In Sumo Logic

Site Reliability Engineering (SRE) is a growing practice essential for enterprises to ensure service delivery, reliability, and access for users. Many companies only choose to invest in SRE when they have a raging operational fire on their hands. As a result, SREs often start out as firefighters, desperately trying to keep the service online for one more day.

Read Post

Sumo Logic

Read more about Tales from the Toil: Taking the pulse of SRE

How to Become a Site Reliability Engineer: Job Description, Roles & Responsibilities

Aug 8, 2022 By Emiliano Pardo Saguier In InvGate

Site Reliability Engineering (SRE) is still going strong in the world of software development. As a bridge between developments and operations, it’s a necessary part of any organization that wants to work like a well-oiled machine. Simply put, SRE tries to fix a widespread problem in organizations: siloing. But not much is known about the job requirements of becoming a site reliability engineer.

Read Post

InvGate

Read more about How to Become a Site Reliability Engineer: Job Description, Roles & Responsibilities

Anti-patterns in Incident Response that you should unlearn

Aug 2, 2022 By Vishal Padghan In Squadcast

It is important to invest time and effort in understanding why a system performs the way it does and how we can improve it. Companies continue with practices that yield successful results, but ignoring anti-patterns can be far worse than choosing rigid processes. In this blog we will explore anti-patterns in incident response and why you should unlearn those.

Read Post

Squadcast

Read more about Anti-patterns in Incident Response that you should unlearn

Analytics in Squadcast | Incident Management | On-call | SRE | Squadcast

Aug 1, 2022 By Squadcast In Squadcast

Analyzing incident data plays a key role to do better SRE. Squadcast's Analytics Dashboard helps you analyze the performance of your Organization/ Team, for a given time period. It also gives you more insight into past outages that affected your systems.

View Video

Squadcast

Read more about Analytics in Squadcast | Incident Management | On-call | SRE | Squadcast

Classifying Severity Levels for Your Organization

Jul 29, 2022 By Nir Sharma In Squadcast

Major outages are bound to occur in even the most well-maintained infrastructure and systems. Being able to quickly classify the severity level also allows your on-call team to respond more effectively. Imagine a scenario where your on-call team is getting critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is inoperative revenue losses are mounting every minute. How do you go about getting your application back on track? This is where understanding incident severity and priority can be invaluable. In this blog we look at severity levels and how they can improve your incident response process.

Read Post

Squadcast

Read more about Classifying Severity Levels for Your Organization

Site Reliability Engineering (SRE) explained

Jul 29, 2022 By Emiliano Pardo Saguier In InvGate

Google has introduced so many innovations that it’d be impossible to list them all. And we’re not just talking about the obvious things like search engine algorithms or nearly-ubiquitous programs and apps (Google Maps, Docs, Gmail) — not even self-driving cars. Today, we’re going to talk about one such innovation: Site Reliability Engineering. In a nutshell, SRE it’s a practical framework for software development that improves on even giants like DevOps. Wait, what?

Read Post

InvGate

Read more about Site Reliability Engineering (SRE) explained

Managing the Looker ecosystem at scale with SRE and DevOps practices

Jul 29, 2022 By Saurabh Bangad In Google Operations

Many organizations struggle to create data-driven cultures where each employee is empowered to make decisions based on data. This is especially true for enterprises with a variety of systems and tools in use across different teams. If you are a leader, manager, or executive focused on how your team can leverage Google's SRE practices or wider DevOps practices, definitely you are in the right place!

Read Post

Google Operations

Read more about Managing the Looker ecosystem at scale with SRE and DevOps practices

Setting up Runbooks in Squadcast | SRE Best Practices | Squadcast

Jul 29, 2022 By Squadcast In Squadcast

A Runbook is a compilation of routine procedures and operations that are documented for reference while working on a critical incident. Sometimes, it can also be referred to as a Playbook. From this video, learn to create, attach, reference and mark progress for incident resolution using Runbooks.

View Video

Squadcast

Read more about Setting up Runbooks in Squadcast | SRE Best Practices | Squadcast

Introduction to reliability management

Jul 29, 2022 By Sumo Logic In Sumo Logic

Ensuring your digital customer experiences are exceptional is a goal of any modern business. However, managing the reliability of ever more complex applications is a challenge. Developers are releasing new capabilities in fast-moving sprints and the business wants maximum velocity with minimal risk. SRE teams create a structure of continuous improvement that focuses on ensuring the application is reliable above all else.

View Video

Sumo Logic

Read more about Introduction to reliability management

Why SREs Need to Embrace Chaos Engineering

Jul 20, 2022 By xMatters In xMatters

Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability. Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled.

Read Post