Operations | Monitoring | ITSM | DevOps | Cloud

Reliability is when customers aren't impacted

Ultimately, a system is reliable when customers and engineers can count on it. Full transcript:  When I get to hear stories like, "Hey, we just had our holiday sales event kick off and everything went smoothly and I didn't have to wake up in the middle of the night." That is really the true definition of reliability these people that are constantly hands-on keyboard in charge of making sure that people like myself and like you aren't impacted when we're going to, for example, buy a new pair of sneakers, or we're going to get some sort of limited edition release that's coming out, right?

Zero-downtime deployment with Flagsmith and CircleCI

As developers, we continually strive to improve our software. This often means rolling out new software features at a rapid pace. However, deploying new features to production is not without risk. From no real production testing to limited rollback options, traditional deployment can quickly become frustrating. The worst issues, though, usually stem from one thing: buggy features making their way into the hands of users.

Using Claude to power up your onboarding

I joined incident.io about ten weeks ago, having been in my previous role for four and a half years. Being a new starter was an unusual feeling for me, and there's been a huge amount to learn; but by lunch on my second day (!) I had started shipping value to our customers. A large part of hitting the ground running has been having a colleague alongside me, who I can pester with questions, who doesn’t get offended when I write in all capitals, and often praises me for being absolutely right!

Incident Management Takes a Giant Leap with Next-Gen ServiceNow Integration

In the fast-paced world of digital operations, the gap between detecting an issue and resolving it can mean the difference between a blip in service and a full-scale customer impact. That’s why organizations worldwide rely on ServiceNow for IT service management and xMatters for intelligent incident response automation.

Real-Time Analytics Made Simple with Kafka and Iceberg

AIVEN DATA PLATFORM The Aiven Platform is more than a collection of open source services for streaming, storing and analyzing data. The platform ensures that all services run reliably and securely in the clouds of your choice, are observable, and can easily be integrated with each other and with external 3rd party tools.

AI for Grafana onboarding: Get your teams started quicker with Grafana Assistant

Grafana puts a powerful set of observability capabilities right at your fingertips, but onboarding entire teams to the sophisticated platform is often a nontrivial exercise—one that can slow adoption and prevent organizations from getting immediate value. We want to make the process as frictionless as possible, which is why we’re excited to tell you that Grafana Assistant is now available in public preview to all Grafana Cloud users.

AI in observability at Grafana Labs: Making observability easy and accessible for everyone

Did you know that observability has been around for more than six decades? It all goes back to a Hungarian-American inventor named Rudolf Kálmán who thought about how external outputs could measure the internal state of a machine. Kálmán wrote about monitoring single-input single-output systems, but our demands are very different today. We need to observe monoliths, microservices, clusters, pods, regions, and many more.

Network Switch Monitoring: How to Monitor Switch Performance with SNMP

If you’ve spent any time managing networks, you know the switch is the backbone that keeps everything connected, but it’s easy to take them for granted until something breaks. Monitoring network switches isn’t just “nice to have”; it’s critical if you want to avoid those sudden outages that bring everything to a halt.

Cortex MCP set up

Learn how to set up the Cortex MCP in under 5 minutes. The MCP integrates directly into your IDE, giving instant access to Cortex data without leaving your coding environment. It reduces context switching by enabling natural questions about services and teams, and streamlines workflows with real-time data from Cortex, Jira, GitHub, and more.