Operations | Monitoring | ITSM | DevOps | Cloud

The Definitive AWS Outage Report 2025: Reliability Analytics and Cascade Impact

Amazon Web Services remains one of the most popular cloud providers, with 200+ services in 39 regions across the world. Like all providers, they have their share of outages. In 2025, IncidentHub detected 38 AWS outages, of which the one on October 20th had the most widespread impact affecting hundreds of SaaS providers simultaneously. Payments were disrupted, students lost access to classrooms, developer tooling degraded, and some IT teams experienced alerting gaps.

The rise of agentic AI in production: Can observability systems run themselves?

Sometimes the biggest shifts in technology aren’t about collecting more data — they’re about who (or what) gets to act on it. In this episode of “Grafana’s Big Tent” podcast, host Tom Wilkie, Grafana Labs CTO, is joined by Spiros Xanthos, Founder & CEO of Resolve AI, Manoj Acharya, VP of Engineering for Observability at Grafana Labs, and Cyril Tovena, Principal Engineer on the Grafana Assistant team, to discuss agentic AI in observability.

The Grafana Labs operating system: Introducing our Guiding Principles

Matt Toback is the VP of Culture at Grafana Labs. We published our original company values back in December 2020. We were a young company, growing fast, and fully remote. Our values at the time were aspirational, and painted a picture of the kind of company we wanted to be. Those values did real work and they mattered. You could hear them used in everyday conversations, and they helped get us to where we are today. But growth has a way of revealing gaps.

This Month in Datadog - February 2026

On the first episode of This Month in Datadog in 2026, Jeremy covers how you can protect agentic AI applications with AI Guard, stay up to date and collaborate during incidents with five Incident Management releases, and ship software with confidence using Feature Flags. Later in the episode, Kevin spotlights Datadog Data Observability, which enables you to detect data quality and pipeline issues early.

Why Evidence-Backed RCA in Edwin AI Starts With Logs

A step-by-step look at how Edwin AI uses native LogicMonitor logs, topology, and context to turn root cause analysis from alert-driven inference into evidence-backed investigation. Most root cause analysis today starts with alerts and ends with explanations that sound reasonable but can’t be verified. An alert is fed into a language model, and the output looks like an answer. It often isn’t.

8 Years of Building Obkio: From Network Monitoring to Observability & Network Diagnostics

In 2016, Obkio was just an idea, but it was an idea born from a real problem. Before writing a single line of code, we conducted a market audit to understand why Network Performance Monitoring solutions weren't more mature. We interviewed banks, manufacturing companies, and service providers, and the answer was unanimous: the NPM tools on the market were too complex, and most businesses simply didn't have the internal resources to dedicate full-time to managing them.

Your Data is Whispering and Needs a Human to Listen

If you have ever owned, operated, or supported a piece of technology, you have probably built a dashboard. Maybe it started as a quick chart to answer a simple question, then quietly grew into something more important. Dashboards are often created by the people who know the systems best, the ones who can wire together data sources and click all the right buttons. But those same builders are rarely trained in how humans actually interpret data.

Top tips: Think it's a recommendation? It might be an ad

Top tips is a weekly column where we highlight what’s trending in the tech world and list ways to explore these trends. This week, we'll be looking at ways we can spot ads disguised as recommendations in today's influencer era. These days, it's getting harder for me to distinguish between an ad and a recommendation.