Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

[SRE: From Theory to Practice] What's difficult about problem detection?

In this episode of FTTP, Kurt Andersen and Matt Davis are joined by Joanna Mazgaj and Laura Nolan to talk about the implications of and considerations for problem detection. Watch the full episode and hear them share personal stories about the types of challenges you might face. Ultimately, how do we explain and address the socio-technical concepts behind problem detection?

[SRE: From Theory to Practice] What's difficult about incident command?

Welcome back to our mini series of fireside chats with SRE experts talking about the realities of their day-to-day. Episode 2 gets intimate — What’s difficult about incident command? We invited Alyson van Hardenberg, Engineering Manager at Honeycomb.io, and Varun Pal, Staff SRE at Procore, to chat with Jake Englund and Matt Davis from the Blameless team. Watch the full conversation where they cover everything from methodologies and technical expertise to the human and social aspects of reliability engineering.

Using Tagging and Routing Rules in Squadcast I Incident Classification I Event Tagging I Squadcast

Event Tagging is a rule-based, auto-tagging system with which you can define customized tags based on incident payloads, that get automatically assigned to incidents when they are triggered. This video explains how to create Tagging rules for efficient Incident Classification.

Adding Incident Watchers in Squadcast | Incident Notifications and Updates | Squadcast

This video talks about Squadcast's Incident Watchers Feature. In Squadcast, any user/stakeholder can subscribe to an Incident and act as a Watcher for an incident. Incident Watchers can choose to receive notifications for all the updates of an incident. This allows any user/stakeholder to act as an observer of the incident, even if they are not active responders. You can customize your watch options for the incident and receive notifications only for those updates.

SRE Vs. DevOps: A Simple Breakdown Of The Differences

You know this already. Regardless of your size, you must keep up with technological developments in your industry — and, increasingly, in other industries, even those that seem unrelated. Embracing disruption can enable you to increase your market share, revenue, and profit margins. Delegating some development and operations responsibilities to Site Reliability Engineering (SRE) experts allows developers to innovate and create new solutions faster.

SRE Principles for Edge Management and Improving Resiliency Using the Best of Kubernetes

This post was co-written by Kirti Apte and Gabry (Maria Gabriella) Brodi. Over the last couple of years, customers have been adopting Kubernetes and microservice-based application deployment models for various technology and business reasons. In fact, there is a trend that customers are now looking to the next set of use cases that include applications across multiple clouds, as well as edge clouds.

Announcing: Blameless + OpsGenie Integration

In the opening moments of an engineering incident, the most important aspect of a response plan is speed. Getting out of the gate quickly by leveraging automation to assemble the team can save precious moments during a critical engineering incident and make the difference between happy and unhappy customers downstream. This is why we’re excited to announce the integration of Blameless with OpsGenie.

Webinar Recap: How Observability Impacts SRE, Development, and Security Teams

In today’s fast paced and constantly evolving digital landscape, observability has become a critical component of effective software development. Companies are relying more on and using machine and telemetry data to fix customer problems, refine software and applications, and enhance security. However, while more data has empowered teams with more insights, the value derived from that data isn’t keeping pace with this growth. So how can these teams derive more value from telemetry data?

Analytics in Squadcast | Visualize Team and Organization Level Analytics | MTTA MTTR | Squadcast

Analyzing incident data plays a key role to do better SRE. Squadcast's Analytics Dashboard helps you analyze the performance of your Organization/ Team, for a given time period. It also gives you more insight into past outages that affected your systems.