March 2023

The Guide to SRE Principles

Mar 31, 2023 By Squadcast Community In Squadcast

Site reliability engineering (SRE) is a discipline in which automated software systems are built to manage the development operations (DevOps) of a product or service. In other words, SRE automates the functions of an operations team via software systems. The main purpose of SRE is to encourage the deployment and proper maintenance of large-scale systems.

Read Post

Squadcast

Read more about The Guide to SRE Principles

Komodor + Squadcast Integration: Simplifying Kubernetes Monitoring & Incident Response

Mar 30, 2023 By Abhishek Sony In Squadcast

Kubernetes (K8s) is a powerful tool for container orchestration, but it presents unique challenges when it comes to monitoring and incident response. Managing K8s requires 360º visibility into your environment, proactive health monitoring, along with right incident management, and suppression capabilities. In this article, we'll explore the benefits of integrating Squadcast with Komodor, two powerful tools that can help you overcome these challenges.

Read Post

Squadcast

Read more about Komodor + Squadcast Integration: Simplifying Kubernetes Monitoring & Incident Response

The neglected tech arctic winter - Internal SaaS expenses

Mar 30, 2023 By Nishant Modak In Last9

The current tech winter has a number of glaring stories — cyclical as they may be, there’s one truth that’s been gleaned over more than the rest; the money spent on internal software tools to support tech infrastructure is bloated. And there’s nothing cyclical about this infrastructure spending.

Read Post

Last9

Read more about The neglected tech arctic winter - Internal SaaS expenses

Recap of SRECon Americas 2023

Mar 29, 2023 By Sebastian Vietz In Last9

A Recap of SRECon 2023 Americas by guest author Sebastian Vietz.

Read Post

Last9

Read more about Recap of SRECon Americas 2023

Announcing our improved Slack integration

Mar 28, 2023 By Vishal Padghan In Squadcast

Slack is one of the most widely used messaging Apps, providing collaboration and chat solutions to businesses. We at Squadcast understand that most of your work happens over Slack. Hence, we have made improvements to our Slack integration capabilities by introducing a bunch of UI and functional improvements. This blog will give you an overview of the latest improvements supported by this integration, which we hope will help in better collaboration and Incident Management.

Read Post

Squadcast

Read more about Announcing our improved Slack integration

Bring Order to On-call Chaos With Splunk Incident Intelligence

Mar 27, 2023 By Annette Sheppard In Splunk

In today’s turbulent times, companies big and small are being pushed to do more with less. Budgets are getting tighter and companies are being pressured to serve customers who demand 24/7 availability from their applications and services. To meet these demands and remain competitive, enterprises are adopting cloud-first strategies and developing applications with microservice architectures.

Read Post

Splunk

Read more about Bring Order to On-call Chaos With Splunk Incident Intelligence

Five Trends from SREcon Americas 2023

Mar 27, 2023 By Gavin Cahill In Gremlin

Last week, over five hundred SREs gathered in Santa Clara to share the latest research, tips, tricks, best practices, and more for site reliability engineering. They were joined by some of the biggest names in the reliability space. And, yes, Gremlin was there to answer any and all questions about chaos engineering and proactive reliability. After three days of great conversations and insightful talk, let’s take a look at some of the themes we heard weaving through SRECon.

Read Post

Gremlin

Read more about Five Trends from SREcon Americas 2023

The Evolution of Incident Management from On-Call to SRE

Mar 24, 2023 By Vardhan NS In Squadcast

Incident Management has evolved considerably over the last couple of decades. Traditionally having been limited to just an on-call team and an alerting system, today it has evolved to include automated Incident Response combined with a complex set of SRE workflows.

Read Post

Squadcast

Read more about The Evolution of Incident Management from On-Call to SRE

What is SRE? Explained in 6 minutes

Mar 24, 2023 By Sematext In Sematext

In this video, we will introduce you to SRE (Site Reliability Engineering) and explain what it is, why it's important, and how you can get started with it. We'll cover everything you need to know to get started with SRE, from the basics of SRE concepts and principles to the tools and skills you'll need to succeed.

View Video

Sematext

Read more about What is SRE? Explained in 6 minutes

Get data-driven executive communication out of the box with Reliability Insights

Mar 23, 2023 By Alex Greer In Blameless

Blameless’s comprehensive incident management platform is built to ease the burden of keeping your services up and running. Whether you are in the middle of an incident or trying to better track your response performance, you need access to your incident data on demand. Blameless’s Reliability Insights unifies your Incident, Resource, Task, and IAM data in a single customizable and queryable analytics tool.

Read Post

Blameless

Read more about Get data-driven executive communication out of the box with Reliability Insights

A Kubernetes Observability Tool to Support SRE Best Practices

Mar 22, 2023 By Lisa Wells In StackState

Kubernetes can be tough to troubleshoot and remediate fast, especially when you have many interdependent services. This blog, part 3 of 3 in the “8 SRE Best Practices to Help Developers Troubleshoot Kubernetes” series, describes the Kubernetes observability foundation StackState has built to support SRE best practices and enable rapid remediation of issues.

Read Post

StackState

Read more about A Kubernetes Observability Tool to Support SRE Best Practices

What is MTBI?

Mar 20, 2023 By Last9 In Last9

Everything you need to know about Mean Time Between Incidents (MTBI) and how it can help Site Reliability Engineers.

Read Post

Last9

Read more about What is MTBI?

How Cortex can help SRE teams amplify their reach

Mar 16, 2023 By Cortex In Cortex

Site reliability engineers can amplify their reach and influence across teams with Cortex in their toolbox. With the ability to define clear standards and drive progress, Cortex enables everyone at an organization to adopt an SRE mindset. Make sure to visit us at SREcon to learn more about how Cortex can serve as a single source of truth for your SRE team!

Read Post

Cortex

Read more about How Cortex can help SRE teams amplify their reach

How We Define SRE Work, as a Team

Mar 16, 2023 By Fred Hebert In Honeycomb

Last year, I wrote How We Define SRE Work. This article described how I came up with the charter for the SRE team, which we bootstrapped right around then. It’s been a while. The SRE team is now four engineers and a manager. We are involved in all sorts of things across the organization, across all sorts of spheres. We are embedded in teams and we handle training, vendor management, capacity planning, cluster updates, tooling, and so on.

Read Post

Honeycomb

Read more about How We Define SRE Work, as a Team

8 SRE Best Practices to Help Developers Troubleshoot Kubernetes

Mar 15, 2023 By Lisa Wells In StackState

Maintaining reliable Kubernetes systems is not easy, especially for people who are not Kubernetes experts. This blog, part 2 of 3 in the “8 SRE Best Practices to Help Developers Troubleshoot Kubernetes” series, explains 8 simple best practices SREs can follow to help developers and other SREs build knowledge and effectively troubleshoot issues in applications running on Kubernetes.

Read Post

StackState

Read more about 8 SRE Best Practices to Help Developers Troubleshoot Kubernetes

What is SOC 2 Compliance? | A Guide to SOC 2 Certification

Mar 15, 2023 By Emily Arnott In Blameless

We’re excited to announce that Blameless is officially SOC 2 compliant! This is part of our larger efforts to assure all the users of Blameless and visitors to our site that we’re meeting and exceeding all of your privacy and security needs. Learn more by visiting our security page! When choosing a service, it’s important to have trust in the provider – especially for something as important as your incident management.

Read Post

Blameless

Read more about What is SOC 2 Compliance? | A Guide to SOC 2 Certification

Squadcast + Auvik Integration: Routing alert made easy

Mar 14, 2023 By Vishal Padghan In Squadcast

Auvik is a cloud-based network management software that gives you instant insight into the networks you manage and automates complex and time-consuming network tasks. If you use Auvik for network management, you can integrate it with Squadcast, an end-to-end incident response tool, to route detailed alerts from Auvik to the right users in Squadcast. This blog is a step-by-step guide that will help you set up Squadcast-Auvik Integration.

Read Post

Squadcast

Read more about Squadcast + Auvik Integration: Routing alert made easy

Protect PII and add geolocation data: Monitoring legacy systems with Grafana

Mar 14, 2023 By Mattias Segerdahl In Grafana

Legacy systems often present a challenge when you try to integrate them with modern monitoring tools, especially when they generate log files that contain personally identifiable information (PII) and IP addresses. Thankfully, Grafana Cloud, which is built to work with modern observability tools and data sources, makes it easy to monitor your legacy environments too.

Read Post

Grafana

Read more about Protect PII and add geolocation data: Monitoring legacy systems with Grafana

Adopting SRE: Standardizing your SLO design process

Mar 11, 2023 By Derek Remund In Google Operations

Designing SLOs is a key SRE competency which requires careful consideration and a consistent approach to implementation.

Read Post

Google Operations

Read more about Adopting SRE: Standardizing your SLO design process

Datadog On Reliability Engineering

Mar 7, 2023 By Datadog In Datadog

There are many different ways to implement Site Reliability Engineering (SRE). From team structures to roles and responsibilities to planning and prioritization flows, there’s no golden path for how to organize things. As Datadog has shifted from a startup to a quickly-growing public company, we’ve seen our own SRE practice evolve. With over 22,000 customers sending trillions of data points each day, keeping Datadog reliable is critical to our business.

View Video

Datadog

Read more about Datadog On Reliability Engineering

What Does IT Maturity Even Mean?

Mar 7, 2023 By Aaron Lober In Blameless

Seriously… What are people trying to say by “Your approach to IT Operations needs to mature”? Fair question. Billions of dollars are spent every year on software solutions to help IT organizations operate more efficiently. How could it be that with all that investment, we’re still not netting enough efficiency gains? The truth is, our technology landscape has evolved, our operational models have evolved, we have evolved.

Read Post

Blameless

Read more about What Does IT Maturity Even Mean?

SLA vs SLO vs SLI - What's the difference

Mar 7, 2023 By Last9 In Last9

What's the difference between SLAs vs SLOs vs SLIs. Understanding these little nuances are critical for DevOps folks. Here's a simple reckoner on what each of these mean.

Read Post

Last9

Read more about SLA vs SLO vs SLI - What's the difference

Operations | Monitoring | ITSM | DevOps | Cloud

March 2023

The Guide to SRE Principles

Komodor + Squadcast Integration: Simplifying Kubernetes Monitoring & Incident Response

The neglected tech arctic winter - Internal SaaS expenses

Recap of SRECon Americas 2023

Announcing our improved Slack integration

Bring Order to On-call Chaos With Splunk Incident Intelligence

Five Trends from SREcon Americas 2023

The Evolution of Incident Management from On-Call to SRE

What is SRE? Explained in 6 minutes

Get data-driven executive communication out of the box with Reliability Insights

A Kubernetes Observability Tool to Support SRE Best Practices

What is MTBI?

How Cortex can help SRE teams amplify their reach

How We Define SRE Work, as a Team

8 SRE Best Practices to Help Developers Troubleshoot Kubernetes

What is SOC 2 Compliance? | A Guide to SOC 2 Certification

Squadcast + Auvik Integration: Routing alert made easy

Protect PII and add geolocation data: Monitoring legacy systems with Grafana

Adopting SRE: Standardizing your SLO design process

Datadog On Reliability Engineering

What Does IT Maturity Even Mean?

SLA vs SLO vs SLI - What's the difference

Monthly Archive

Follow Us