Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Metrics Monitoring: The Only Guide You'll Need

When major tech companies maintain high availability while others struggle with frequent outages, the difference often comes down to one thing: effective metrics monitoring. This guide will walk you through everything you need to know about metrics monitoring, from fundamental concepts to advanced strategies.

Distributed Network Monitoring: Guide to Getting Started & Troubleshooting

When systems span clouds, containers, and regions, knowing what’s happening under the hood is more than a nice-to-have—it’s critical. Traditional monitoring tools often fall short in these complex setups. That’s where distributed network monitoring steps in. This guide cuts through the noise to offer a clear, practical approach to keeping tabs on distributed systems—without drowning in dashboards or alert fatigue.

A Comprehensive Guide to Monitoring Disk I/O on Linux

In a Linux environment, understanding how your storage devices perform can mean the difference between a system that flies and one that crawls. Whether you're troubleshooting performance issues or fine-tuning your server setup, getting familiar with Linux disk I/O statistics is an essential skill for any tech professional. This guide breaks down everything you need to know about Linux disk I/O stats - from basic concepts to practical monitoring techniques that you can implement today.

How to Use MySQL Performance Analyzer

If you're dealing with slow MySQL queries and wondering why your database performance is lagging, you're not alone. MySQL performance analyzers are key tools for pinpointing bottlenecks, optimizing queries, and ensuring your databases stay efficient and responsive. Let’s explore how these tools can help you keep things running smoothly.

Apache Cassandra Monitoring: Tools, Challenges & Best Practices

When your distributed database architecture scales to handle massive workloads, keeping tabs on everything becomes critical and complex. With its masterless architecture and linear scalability, Apache Cassandra powers mission-critical applications across industries—but without proper monitoring, you might as well be flying blind through a storm.

The New Rootly Ringtones: How Research-based On-Call Sounds

We set out to create a ringtone that wasn’t just loud—but the sound of a modern pager. Something that wakes you up, but without triggering a full-blown adrenaline spike. In this video, go behind the scenes with sound engineer Gorjão as he crafts a how research-based on-call sound sounds like.

GDPR Log Management: A Practical Guide for Engineers

GDPR compliance for logs can be tricky—especially when you're trying to maintain system visibility and protect user data at the same time. For SREs and IT teams, it’s a balancing act between staying on the right side of privacy laws and not losing the context you need to troubleshoot. This guide walks through practical ways to handle personal data in logs, set up retention rules that make sense, and stay compliant without creating unnecessary friction.

Why Reliability Starts with the Network, even in the AI era, with Marino Wijay

In this episode, we explore how networking has shaped reliability as we know it. Marino Wijay cloud networking expert and Staff Solutions Architect at Kong shares how his journey began not as an SRE, but with cables, routers, and switches. Marino explains the evolution of the fabric holding systems together through virtualization, and how software-defined networking, which is now a key element to resilient applications.