Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Get data-driven executive communication out of the box with Reliability Insights

Blameless’s comprehensive incident management platform is built to ease the burden of keeping your services up and running. Whether you are in the middle of an incident or trying to better track your response performance, you need access to your incident data on demand. Blameless’s Reliability Insights unifies your Incident, Resource, Task, and IAM data in a single customizable and queryable analytics tool.

A Kubernetes Observability Tool to Support SRE Best Practices

Kubernetes can be tough to troubleshoot and remediate fast, especially when you have many interdependent services. This blog, part 3 of 3 in the “8 SRE Best Practices to Help Developers Troubleshoot Kubernetes” series, describes the Kubernetes observability foundation StackState has built to support SRE best practices and enable rapid remediation of issues.

How Cortex can help SRE teams amplify their reach

Site reliability engineers can amplify their reach and influence across teams with Cortex in their toolbox. With the ability to define clear standards and drive progress, Cortex enables everyone at an organization to adopt an SRE mindset. Make sure to visit us at SREcon to learn more about how Cortex can serve as a single source of truth for your SRE team!

How We Define SRE Work, as a Team

Last year, I wrote How We Define SRE Work. This article described how I came up with the charter for the SRE team, which we bootstrapped right around then. It’s been a while. The SRE team is now four engineers and a manager. We are involved in all sorts of things across the organization, across all sorts of spheres. We are embedded in teams and we handle training, vendor management, capacity planning, cluster updates, tooling, and so on.

8 SRE Best Practices to Help Developers Troubleshoot Kubernetes

Maintaining reliable Kubernetes systems is not easy, especially for people who are not Kubernetes experts. This blog, part 2 of 3 in the “8 SRE Best Practices to Help Developers Troubleshoot Kubernetes” series, explains 8 simple best practices SREs can follow to help developers and other SREs build knowledge and effectively troubleshoot issues in applications running on Kubernetes.

What is SOC 2 Compliance? | A Guide to SOC 2 Certification

We’re excited to announce that Blameless is officially SOC 2 compliant! This is part of our larger efforts to assure all the users of Blameless and visitors to our site that we’re meeting and exceeding all of your privacy and security needs. Learn more by visiting our security page! When choosing a service, it’s important to have trust in the provider – especially for something as important as your incident management.

Squadcast + Auvik Integration: Routing alert made easy

Auvik is a cloud-based network management software that gives you instant insight into the networks you manage and automates complex and time-consuming network tasks. If you use Auvik for network management, you can integrate it with Squadcast, an end-to-end incident response tool, to route detailed alerts from Auvik to the right users in Squadcast. This blog is a step-by-step guide that will help you set up Squadcast-Auvik Integration.

Protect PII and add geolocation data: Monitoring legacy systems with Grafana

Legacy systems often present a challenge when you try to integrate them with modern monitoring tools, especially when they generate log files that contain personally identifiable information (PII) and IP addresses. Thankfully, Grafana Cloud, which is built to work with modern observability tools and data sources, makes it easy to monitor your legacy environments too.