August 2023

Reimagining Retrospectives

Aug 31, 2023 By Blameless In Blameless

The Blameless retrospective is one of the most often discussed and rarely executed components of the SRE practice. Getting real value from the retrospective process takes time, focus and the right approach. This webinar features Ken Gavranovic and author of Architecting For Scale Lee Atchison, where they discuss the blueprint for high-performing engineering teams to maximize the value of retrospectives.

View Video

Blameless

Read more about Reimagining Retrospectives

Why Resilience Engineering Needs To Be A C-Level Strategy & How To Get There

Aug 30, 2023 By Ryan Green In Reliably

The consequences of downtime and data breaches can be devastating to organizations, leading to substantial financial losses and irreparable damage to a business’s reputation. If last week's outage by the Bank of England is anything to go by, after losing trillions of £’s per day due to downtime, resilience shouldn’t just be an afterthought for organizations.

Read Post

Reliably

Read more about Why Resilience Engineering Needs To Be A C-Level Strategy & How To Get There

Latest Developments in Site Reliability Engineering, 2023

Aug 30, 2023 By Halle Katz In OnPage

Gartner recently published its Hype Cycle for Site Reliability Engineering, 2023, (July 2023) report. OnPage was inspired by this report to share its prediction about the future of site reliability engineering. In this blog, OnPage will review evolutionary tools that can improve site reliability engineering practices.

Read Post

OnPage

Read more about Latest Developments in Site Reliability Engineering, 2023

Why the Blameless Mission Matters Today

Aug 30, 2023 By Emily Arnott In Blameless

Blameless was founded over 5 years ago, in a world that looked very different than the world today. We were the first mover in the incident management space, setting the standards for what these tools should achieve. These days, concerns about reliability, incidents, and toil have hit the mainstream. Why have we seen the tech world enter an era where reliability is priority #1? Why do we believe that the Blameless mission matters more today than ever before?

Read Post

Blameless

Read more about Why the Blameless Mission Matters Today

Levitate - Last9's managed TSDB is now available on the AWS Marketplace

Aug 29, 2023 By Prathamesh Sonpatki In Last9

Levitate - Last9's managed TSDB is available on AWS Marketplace.

Read Post

Last9

Read more about Levitate - Last9's managed TSDB is now available on the AWS Marketplace

A Practical Guide to Incident Communication

Aug 28, 2023 By Emily Arnott In Blameless

Even the best software fails sometimes. How quickly those failures get addressed, and how your teammates and customers feel about you after the fact, comes down to how well you communicate with them. Users, customer success managers, Ops team members, IT, security, engineering leadership, even the executive team. Each has a vested interest in resolving engineering incidents quickly. All need to be updated with the right information at the right time.

Read Post

Blameless

Read more about A Practical Guide to Incident Communication

Continuous Deployment vs. Delivery | Differences Explained

Aug 28, 2023 By Noor-ul-Anam Ruqayya In Blameless

Curious about continuous deployment vs delivery? We explain what each is, what happens in each step, and their importance in the DevOps lifecycle.

Read Post

Blameless

Read more about Continuous Deployment vs. Delivery | Differences Explained

What is MTTR? The Different Meanings Explained

Aug 28, 2023 By Noor-ul-Anam Ruqayya In Blameless

Curious about MTTR? We explain what the mean time to recovery is, why it matters to your development team, and how to reduce it.

Read Post

Blameless

Read more about What is MTTR? The Different Meanings Explained

Incident Management KPIs | Choosing Metrics that Matter

Aug 28, 2023 By Noor-ul-Anam Ruqayya In Blameless

Wondering about incident management KPIs? We explain what incident management metrics are, how to track them, and what to do with the information.

Read Post

Blameless

Read more about Incident Management KPIs | Choosing Metrics that Matter

How to use Key-Based Deduplication in Squadcast | Deduplication Rules | Squadcast

Aug 28, 2023 By Squadcast In Squadcast

Key Based Deduplication is an efficient way to avoid duplicate entries when processing incoming Events alongside existing Incidents. It generates a Deduplication Key using a user-defined template specific to events from an Alert Source. This key helps identify and group duplicates. This video explains how does Key Based Deduplication work and how to set it up effectively.

View Video

Squadcast

Read more about How to use Key-Based Deduplication in Squadcast | Deduplication Rules | Squadcast

Helm Dry Run: Guide & Best Practices

Aug 27, 2023 By Squadcast Community In Squadcast

Kubernetes, the de-facto standard for container orchestration, supports two deployment options: imperative and declarative. Because they are more conducive to automation, declarative deployments are typically considered better than imperative. A declarative paradigm involves: The issue with the declarative approach is that YAML manifest files are static.

Read Post

Squadcast

Read more about Helm Dry Run: Guide & Best Practices

Managing On-Call Rotations: Navigating Incident Management from Chaos to Calm

Aug 25, 2023 By Chitra Bisht In Squadcast

Navigating On-Call rotations can often feel like taming a storm of alerts and constant disruptions, leaving teams overwhelmed and stressed. Hence there is a need to streamline On-Call rotations and leverage concerned software to restore order and peace. In this guide, you'll explore practical tips, best practices, and smart strategies to transform your Incident Management process. Let's get to a more efficient On-Call experience.

Read Post

Squadcast

Read more about Managing On-Call Rotations: Navigating Incident Management from Chaos to Calm

PromQL Macros in Levitate

Aug 25, 2023 By Prathamesh Sonpatki In Last9

Define PromQL Macros to standardize complex PromQL queries in Levitate.

Read Post

Last9

Read more about PromQL Macros in Levitate

GCP Managed Service For Prometheus vs. Levitate

Aug 24, 2023 By Prathamesh Sonpatki In Last9

A detailed comparison of Levitate and Google Managed Prometheus - Cost, Scale and Ease of Use.

Read Post

Last9

Read more about GCP Managed Service For Prometheus vs. Levitate

A case for Observability outside engineering teams

Aug 23, 2023 By Aniket Rao In Last9

Observability is being built by engineers for engineers. In reality, o11y is for all.

Read Post

Last9

Read more about A case for Observability outside engineering teams

We Need to Talk About the Hero Pattern Among SREs

Aug 22, 2023 By Hans Chung In Rootly

Let’s be honest. When you see an alert pop up on your phone, you aren’t thinking “according to section 12 of our most recent SRE handbook used at training 6 months ago I need to keep in mind who should be Incident Commander and who should be Ops Lead”. You’re an engineer at heart.

Read Post

Rootly

Read more about We Need to Talk About the Hero Pattern Among SREs

The Iceberg of Engineering Incident Costs

Aug 22, 2023 By Aaron Lober In Blameless

I've long been fascinated with the metaphor of an iceberg to describe a problem who’s true magnitude is obscured beneath the surface. If you’re not familiar with this phenomenon, when ice freezes it decreases in density. This allows the solid ice to float, partially, atop the water with only a small fraction of it exposed. In fact, icebergs hold nearly 90% of their mass hidden below the water.

Read Post

Blameless

Read more about The Iceberg of Engineering Incident Costs

Understanding the Rasmussen model for failures

Aug 18, 2023 By Nishant Modak In Last9

What does the Rasmussen model teach us about Site Reliability Engineering?

Read Post

Last9

Read more about Understanding the Rasmussen model for failures

Checking your observability and communication platforms with Reliably

Aug 17, 2023 By Reliably In Reliably

#reliably #chaosengineering #honeycomb #slack #resilience
In this video, we will use a chaos engineering experiment, that we expect to fail, to verify our open tracing and communication platforms are correctly set up. Using the Honeycomb and Slack integrations provided by Reliably, we will send traces and messages and observe if they are triggered as expected.

View Video

Reliably

Read more about Checking your observability and communication platforms with Reliably

10 Observability Tools in 2023: Features, Market Share and Choose the Right One for You

Aug 17, 2023 By Anjali Udasi In Zenduty

Understanding what's happening within your systems is a necessity. Have you ever wondered how experts keep an eye on systems to make sure everything's running smoothly? That's where observability tools come in! Observability tools are like helpers that give you a peek inside your tech. In this blog, we will talk about observability tools and how they can be used in different situations so it's easier for you to choose the right one for your organization.

Read Post

Zenduty

Read more about 10 Observability Tools in 2023: Features, Market Share and Choose the Right One for You

Impact of Kubernetes cluster maintenance on application availability

Aug 17, 2023 By Reliably In Reliably

#kubernetes #eks #chaosengineering
In this video, we will be exploring an interesting scenario that might happen in real life. Let's imagine we have an application running in a Kubernetes cluster inside EKS. If for any reason, two of our three nodes are cordoned and can't be scheduled anymore, what would happen to our users should the last node be cordoned as well? And what if we need to reschedule something?

View Video

Reliably

Read more about Impact of Kubernetes cluster maintenance on application availability

An introduction to Reliably

Aug 16, 2023 By Reliably In Reliably

#reliably #chaosengineering #resilience
In this video, we'll show how you can use starters and the Reliably cloud to get started very quickly.

View Video

Reliably

DevOps
SRE

Read more about An introduction to Reliably

Running an experiment in GitHub with Reliably

Aug 16, 2023 By Reliably In Reliably

#reliably #chaosengineering #github #githubactions #resilience
Reliably lets you run experiments not only from the Reliably cloud but from your own environment. This video will focus on running a chaos engineering experiment in GitHub.

View Video

Reliably

Read more about Running an experiment in GitHub with Reliably

Running Reliably experiments from a Kubernetes cluster

Aug 16, 2023 By Reliably In Reliably

#reliably #chaosengineering #resilience #kubernetes #k8s
Reliably lets you run experiments not only from the Reliably cloud but from your own environment. This video will focus on running a chaos engineering experiment in a Kubernetes cluster.

View Video

Reliably

Read more about Running Reliably experiments from a Kubernetes cluster

But It's Not Our Fault! When Third-party Incidents Affect Your Service

Aug 14, 2023 By Ashley Sawatsky In Rootly

Very few SaaS products exist completely independently. Between cloud service providers, payment processors, content delivery networks, and more, chances are you rely on external systems to keep your product working. When these systems fail, it can leave you feeling pretty helpless. In some cases you might have fallback options, but oftentimes all you can do is wait for recovery and clean up the fallout.

Read Post

Rootly

Read more about But It's Not Our Fault! When Third-party Incidents Affect Your Service

How we tame High Cardinality by Sharding a stream

Aug 14, 2023 By Piyush Verma In Last9

Using 'Sharding' to tame High Cardinality data for Levitate - Our Time Series Data Warehouse.

Read Post

Last9

Read more about How we tame High Cardinality by Sharding a stream

Azure Monitoring Agent: Key Features & Benefits

Aug 13, 2023 By Squadcast Community In Squadcast

In today's rapidly evolving digital landscape, businesses increasingly rely on cloud computing and infrastructure to support their operations. As organizations migrate their workloads to the cloud, robust monitoring and management tools are paramount to ensure optimal performance, security, and efficiency. In response to this demand, Microsoft Azure has introduced the Azure Monitoring Agent (AMA), a powerful and versatile solution designed to enhance the monitoring capabilities of Azure resources.

Read Post

Squadcast

Read more about Azure Monitoring Agent: Key Features & Benefits

Splashing into Data Lakes: The Reservoir of Observability

Aug 11, 2023 By JJ Jeffries, Head of Marketing In ObservIQ

If you’re a systems engineer, SRE, or just someone with a love for tech buzzwords, you’ve likely heard about “data lakes”. Before we dive deep into this concept, let’s debunk the illusion: there aren’t any floaties or actual lakes involved! Instead, imagine a vast reservoir where you store loads and loads of raw data in its natural format. Now, pair this with the idea of observability and telemetry pipelines, and we have ourselves an engaging topic.

Read Post

ObservIQ

Read more about Splashing into Data Lakes: The Reservoir of Observability

How To Write Incident Postmortems

Aug 10, 2023 By Anjali Udasi In Zenduty

Writing a public postmortem regarding an outage is essential to maintaining transparency and accountability when things go wrong in a service or system. The purpose of writing a postmortem is to analyze and document an incident or event that has occurred, usually with a focus on identifying its root causes, understanding what went wrong, and outlining steps to prevent similar issues from happening in the future.

Read Post

Zenduty

Read more about How To Write Incident Postmortems

Rootly Raises $12 Million from Renegade Partners, Google Gradient Ventures, & XYZ Ventures

Aug 10, 2023 By JJ Tang In Rootly

We are excited to announce that we have raised a $12M round of financing led by Renegade Partners with participation from Google Gradient Ventures (Google’s AI-focused venture fund) and XYZ Ventures. This brings our total funding to date to $15.2M ($20M CAD) alongside our other existing investors Y Combinator and 8VC.

Read Post

Rootly

Read more about Rootly Raises $12 Million from Renegade Partners, Google Gradient Ventures, & XYZ Ventures

SRE in Transition: From Startup to Enterprise

Aug 9, 2023 By Datadog In Datadog

"Startups are defined by “ship or die”. As a result, SRE teams at a startup should be focused on enabling product engineers to ship features as quickly as possible. As your startup transitions from “we’ll run out of money in the next 18 months” to “we have more than 1000 engineers”, how should the SRE organization evolve and provide the best value through that transition (including booting one up if you don’t have one)? I will discuss specific ways the organization needs to evolve to meet this challenge, how the SRE org can advocate for and support this change (both in direct actions and in “influence”), and how the overhang of startup technical and cultural debt can make this shift more challenging (but also more necessary).

View Video

Datadog

Read more about SRE in Transition: From Startup to Enterprise

Tools and Trends in Site Reliability Engineering according to Gartner's 2023 Hype Cycle

Aug 9, 2023 By Halle Katz In OnPage

Gartner recently published its Hype Cycle for Site Reliability Engineering, 2023, report. This blog reviews the future of site reliability engineering based on Gartner’s Hype Cycle. Additionally, the OnPage team is pleased that Gartner mentioned OnPage as a sample vendor in the Automated Incident Response category.

Read Post

OnPage

Read more about Tools and Trends in Site Reliability Engineering according to Gartner's 2023 Hype Cycle

Thanos vs. VictoriaMetrics

Aug 9, 2023 By Last9 In Last9

A deep dive comparison between Thanos and VictoriaMetrics: Performance and Differences.

Read Post

Last9

Read more about Thanos vs. VictoriaMetrics

Observability vs. Telemetry vs. Monitoring

Aug 9, 2023 By Last9 In Last9

Observability vs Telemetry vs Monitoring - What they are, differences and what lies in future.

Read Post

Last9

Read more about Observability vs. Telemetry vs. Monitoring

Virtual Roundtable - Site Reliability Engineers

Aug 8, 2023 By Cortex In Cortex

View Video

Cortex

DevOps
SRE

Read more about Virtual Roundtable - Site Reliability Engineers

Evolution of Site Reliability - Incidentally Reliable with Manoj Sebastian

Aug 4, 2023 By Zenduty In Zenduty

Catch Manoj Sebastian(ex-Flipkart, Amazon, Atlassian, Intuit, Yahoo) talk about The Evolution of SRE through 20 years, Incident Response and Post Incident Culture at Big Tech and the Future of Reliability with AI ramping up at full speed. The freshest podcast for Site Reliability Engineers, hosted by Vishwa and Shubham from Zenduty.

View Video

Zenduty

Read more about Evolution of Site Reliability - Incidentally Reliable with Manoj Sebastian

Unveiling Squadcast's Enhanced Status Pages

Aug 3, 2023 By Sanjog Sandhu In Squadcast

Meet Kevin and Mai (again): Navigating the Troublesome Waters of Platform Downtime. Kevin is a Site Reliability Engineer (SRE), constantly on the lookout for potential downtime that could impact their platform, kryptobro.com. Mai is his adept partner, ever-ready to troubleshoot. In their journey, the previous version of Squadcast Status Pages served as a helpful tool, but they soon found room for improvements.

Read Post

Squadcast

Read more about Unveiling Squadcast's Enhanced Status Pages

SRE Redefines IT Operations as Architect of Sustainable Systems

Aug 3, 2023 By Ari Stowe In Resolve

Site Reliability Engineering (SRE) is a term that’s getting attention and gaining momentum – and for a good reason. SRE takes features of software engineering and applies them to various problems in infrastructures and operations. Organizations look to build SRE teams with a couple goals in mind, including to create and increase scalability and develop solid software systems.

Read Post

Resolve

Read more about SRE Redefines IT Operations as Architect of Sustainable Systems

Kubernetes Incident Management Best Practices

Aug 3, 2023 By Rajesh Tilwani In Rootly

Creating just any infrastructure on Kubernetes is not enough. There are so many basic configurations you could apply and create the infrastructure for your application for the time being and it might work just fine. The incident responses won’t always remain 100% reliable. You will run into newer potholes, and that’s okay.

Read Post

Rootly

Read more about Kubernetes Incident Management Best Practices

Understanding Blameless Postmortems

Aug 2, 2023 By Anjali Udasi In Zenduty

Progress often accompanies unforeseen challenges and mishaps in organizations. Traditionally, these setbacks resulted in pointing fingers, hindering progress, and creating a negative work atmosphere. However, a "Blameless Postmortems" approach transforms how organizations respond to failure. In this blog, we will delve into the importance of cultivating a blameless postrmortem culture when faced with setbacks.

Read Post

Zenduty

Read more about Understanding Blameless Postmortems

Introducing Squadcast's Key Based Deduplication

Aug 1, 2023 By Vishal Padghan In Squadcast

We are excited to share another feature update with all our valued customers! We have recently gone live with our Key Based Deduplication feature, enabling you to define dedup keys using customizable templates for configured alert sources. With this feature, you can automatically group similar incidents and effectively deduplicate alerts.

Read Post

Squadcast

Read more about Introducing Squadcast's Key Based Deduplication

Operations | Monitoring | ITSM | DevOps | Cloud

August 2023