Operations | Monitoring | ITSM | DevOps | Cloud

The Definitive AWS Outage Report 2025: Reliability Analytics and Cascade Impact

Amazon Web Services remains one of the most popular cloud providers, with 200+ services in 39 regions across the world. Like all providers, they have their share of outages. In 2025, IncidentHub detected 38 AWS outages, of which the one on October 20th had the most widespread impact affecting hundreds of SaaS providers simultaneously. Payments were disrupted, students lost access to classrooms, developer tooling degraded, and some IT teams experienced alerting gaps.

Why measuring things openly is the first step toward a stronger engineering culture

Most engineering leaders know they should be measuring more. What holds many of them back is a quieter concern about whether the organization is actually ready to see the numbers. This tension, however, did not keep Ganesh Datta, our co-founder and CTO, and Randy Shoup, SVP of Engineering at Thrive Market, from diving down this rabbit hole on the Braintrust podcast.

How to Create an AI Chatbot for Your Website?

Chatbots are starting to look fairly promising for businesses of all kinds. Customers today are keen to get things resolved faster than ever. Every startup out there is tempted to take the deal. But before jumping onto the bandwagon, you need to do some thinking as to what type of chatbot you must invest in. The decisive question being, which model of conversational AI perfectly aligns with the needs of your organization.

From RCA to Autonomous Ops: The Future of AI in Observability | Big Tent S3E7

SREs are famously skeptical of AI — so how do you convince them to trust agents in production? In this episode of Grafana’s Big Tent, Tom Wilkie talks with Spiros Xanthos (Resolve AI), Manoj Acharya (Grafana Labs), and Cyril Tovena (Grafana Assistant team) about agent-first observability. They unpack knowledge graphs, LLM reasoning, autonomous debugging, pricing models, and the “Claude Code moment” for observability. Is autonomous production ops closer than we think?

The rise of agentic AI in production: Can observability systems run themselves?

Sometimes the biggest shifts in technology aren’t about collecting more data — they’re about who (or what) gets to act on it. In this episode of “Grafana’s Big Tent” podcast, host Tom Wilkie, Grafana Labs CTO, is joined by Spiros Xanthos, Founder & CEO of Resolve AI, Manoj Acharya, VP of Engineering for Observability at Grafana Labs, and Cyril Tovena, Principal Engineer on the Grafana Assistant team, to discuss agentic AI in observability.

The Grafana Labs operating system: Introducing our Guiding Principles

Matt Toback is the VP of Culture at Grafana Labs. We published our original company values back in December 2020. We were a young company, growing fast, and fully remote. Our values at the time were aspirational, and painted a picture of the kind of company we wanted to be. Those values did real work and they mattered. You could hear them used in everyday conversations, and they helped get us to where we are today. But growth has a way of revealing gaps.

This Month in Datadog - February 2026

On the first episode of This Month in Datadog in 2026, Jeremy covers how you can protect agentic AI applications with AI Guard, stay up to date and collaborate during incidents with five Incident Management releases, and ship software with confidence using Feature Flags. Later in the episode, Kevin spotlights Datadog Data Observability, which enables you to detect data quality and pipeline issues early.