Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Sponsored Post

Site Reliability Engineering: Definition, Principles & How It Differs From DevOps

Site crashes and outages can cost hundreds of thousands in lost revenue and inconvenience users. Site Reliability Engineering helps build highly reliable and scalable systems, particularly important for companies that depend on their software to support their customers performing critical operations. Hiring a Site Reliability Engineer is the best way to ensure a software system stays up and running at all times. Not only will they help manage infrastructure and applications, but they'll also be able to advise on how to scale a business as it grows - keeping downtime and incidents at a minimum!

Uptime + Squadcast Integration: Routing Alerts Made Easy

Uptime is a site monitoring solution used to reach various endpoints & notify users via push notifications when downtime is detected. It collects and stores downtime & response time data & which is then made available as reports to the users. If you use Uptime for your monitoring needs, you can now integrate it with Squadcast to route detailed alerts from Uptime to the right users in Squadcast. The below steps will help you set up Uptime and Squadcast integration.

geeks+gurus: Rise of SRE - Survey Insights

Site Reliability Engineering (SRE) continues to rise in adoption. Teams that leverage SRE “good” practices are benefitting, individuals are excited about their jobs and IT and the business are collaborating more efficiently. Sounds interesting? We hope so, as there are a few key insights which you should know. Join us to learn more about the exciting journey of SRE. We have partnered with DevOps Institute (DOI) to conduct their inaugural 2022 Global SRE Pulse Survey, and we are excited to share the pulse on SRE.

Comparing DBA, DBRE, and SRE Roles

As I navigate further into my career, I’m finding the scope of my role has shifted over the years. I thought I’d take some time to help relay the differences I’ve seen between traditional database administrators (DBAs), database reliability engineers (DBREs), and site reliability engineers (SREs). Before I start, I want to get a disclaimer out of the way: some of the comparisons here reflect only what I’ve seen and may not match what you’ve experienced.

Tales from the Toil: Taking the pulse of SRE

Site Reliability Engineering (SRE) is a growing practice essential for enterprises to ensure service delivery, reliability, and access for users. Many companies only choose to invest in SRE when they have a raging operational fire on their hands. As a result, SREs often start out as firefighters, desperately trying to keep the service online for one more day.

How to Become a Site Reliability Engineer: Job Description, Roles & Responsibilities

Site Reliability Engineering (SRE) is still going strong in the world of software development. As a bridge between developments and operations, it’s a necessary part of any organization that wants to work like a well-oiled machine. Simply put, SRE tries to fix a widespread problem in organizations: siloing. But not much is known about the job requirements of becoming a site reliability engineer.

Anti-patterns in Incident Response that you should unlearn

It is important to invest time and effort in understanding why a system performs the way it does and how we can improve it. Companies continue with practices that yield successful results, but ignoring anti-patterns can be far worse than choosing rigid processes. In this blog we will explore anti-patterns in incident response and why you should unlearn those.

Analytics in Squadcast | Incident Management | On-call | SRE | Squadcast

Analyzing incident data plays a key role to do better SRE. Squadcast's Analytics Dashboard helps you analyze the performance of your Organization/ Team, for a given time period. It also gives you more insight into past outages that affected your systems.
Sponsored Post

Classifying Severity Levels for Your Organization

Major outages are bound to occur in even the most well-maintained infrastructure and systems. Being able to quickly classify the severity level also allows your on-call team to respond more effectively. Imagine a scenario where your on-call team is getting critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is inoperative revenue losses are mounting every minute. How do you go about getting your application back on track? This is where understanding incident severity and priority can be invaluable. In this blog we look at severity levels and how they can improve your incident response process.

Site Reliability Engineering (SRE) explained

Google has introduced so many innovations that it’d be impossible to list them all. And we’re not just talking about the obvious things like search engine algorithms or nearly-ubiquitous programs and apps (Google Maps, Docs, Gmail) — not even self-driving cars. Today, we’re going to talk about one such innovation: Site Reliability Engineering. In a nutshell, SRE it’s a practical framework for software development that improves on even giants like DevOps. Wait, what?