Latest Blogs

Announcing the Datadog Terraform provider v4.0.0

Mar 20, 2026 By David Iparraguirre In Datadog

Datadog supports managing Datadog configuration as code through the Datadog Terraform provider. As platform engineering practices evolve, we are focused on making this provider more reliable and trustworthy at enterprise scale.

Read Post

Datadog

Read more about Announcing the Datadog Terraform provider v4.0.0

Instrument zerocode observability for LLMs and agents on Kubernetes

Mar 20, 2026 By Ishan Jain In Grafana

Building AI services with large language models and agentic frameworks often means running complex microservices on Kubernetes. Observability is vital, but instrumenting every pod in a distributed system can quickly become a maintenance nightmare. OpenLIT Operator solves this problem by automatically injecting OpenTelemetry instrumentation into your AI workloads—no code changes or image rebuilds required.

Read Post

Grafana

Read more about Instrument zerocode observability for LLMs and agents on Kubernetes

Monitor Model Context Protocol (MCP) servers with OpenLIT and Grafana Cloud

Mar 20, 2026 By Ishan Jain In Grafana

Large language models don’t work in a vacuum. They often rely on Model Context Protocol (MCP) servers to fetch additional context from external tools or data sources. MCP provides a standard way for AI agents to talk to tool servers, but this extra layer introduces complexity. Without visibility, an MCP server becomes a black box: you send a request and hope a tool answers. When something breaks, it’s hard to tell if the agent, the server or the downstream API failed.

Read Post

Grafana

Read more about Monitor Model Context Protocol (MCP) servers with OpenLIT and Grafana Cloud

Observe your AI agents: Endtoend tracing with OpenLIT and Grafana Cloud

Mar 20, 2026 By Ishan Jain In Grafana

In another post in this series, we discussed how to instrument large language model (LLM) calls. This can be a good starting point, but generative AI workloads increasingly rely on agents, which are systems that plan, call tools, reason, and act autonomously. And their non‑deterministic behavior makes incidents harder to diagnose, in part, because the same prompt can trigger different tool sequences and costs.

Read Post

Grafana

Read more about Observe your AI agents: Endtoend tracing with OpenLIT and Grafana Cloud

How to monitor LLMs in production with Grafana Cloud,OpenLIT, and OpenTelemetry

Mar 20, 2026 By Ishan Jain In Grafana

Moving a large language model (LLM) application from a demo to a production‑scale service raises very different questions than the ones you ask when playing with an API key in a notebook. In production, you have to answer: How much is each model costing us? Are we keeping latency within our service‑level objectives? Are we accidentally returning hallucinations or toxic content? Is the system vulnerable to prompt‑injection attacks?

Read Post

Grafana

Read more about How to monitor LLMs in production with Grafana Cloud,OpenLIT, and OpenTelemetry

Best On-Call Management Software for Teams that Need Faster Response Time

Mar 20, 2026 By Ritika Bramhe In OnPage

Teams running modern infrastructure can’t afford slow incident response. On-call management software ensures the right person is alerted instantly, incidents are escalated intelligently, and downtime is minimized. This guide breaks down the best on-call management software for 2026, helping teams choose the right platform based on their specific use case, response requirements, and operational complexity.

Read Post

OnPage

Read more about Best On-Call Management Software for Teams that Need Faster Response Time

Introducing MicroCloud Cluster Manager

Mar 20, 2026 By Miona Aleksic In Canonical

Today, we’re excited to introduce the beta release of MicroCloud Cluster Manager, a new way to discover, organize, and operate your MicroCloud environments from a single, unified interface. MicroCloud is an open source cloud platform that makes it simple to create lightweight, resilient clusters anywhere. As teams scale from one cluster to many, visibility and coordination quickly become essential. Cluster Manager is built to solve exactly that.

Read Post

Canonical

Read more about Introducing MicroCloud Cluster Manager

Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

Mar 20, 2026 By Kush Dubey In Sentry

Seer is our AI agent that takes bugs and uses all of the context Sentry has to find the root cause and suggest a fix. We use it all the time to help us improve Sentry. Seer fixes Sentry. More recently, Seer has been helping us fix itself — Seer fixing Seer. An upstream outage triggered a bit of an avalanche, revealing a bug that had been hiding away for months. When it came time to fix it, Seer pointed us exactly where we needed to look.

Read Post

Sentry

Read more about Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

Best Incident Management Tools & ITSM Practices to Reduce MTTR in 2026

Mar 20, 2026 By AlertOps In AlertOps

Here’s a scenario most IT teams know too well: a single error message lights up the monitoring dashboard at 2 a.m. Within seconds, calls are coming in from customers. Within minutes, the revenue meter is running. If your team is still figuring out who owns the incident while that meter ticks, you’ve already lost precious time. According to 2024 EMA Research, unplanned IT downtime now costs organizations an average of $14,056 per minute, rising to $23,750 per minute for large enterprises.

Read Post

AlertOps

Read more about Best Incident Management Tools & ITSM Practices to Reduce MTTR in 2026

Error Monitoring for Elixir: Now in Scout APM

Mar 20, 2026 By Lance Erickson In Scout

Elixir’s “let it crash” philosophy is one of the best ideas in modern software design. Supervisors restart failed processes, the system self-heals, and life goes on. It’s like having a really good immune system. The problem is that a really good immune system can also hide chronic conditions. A GenServer crashing and restarting is working as designed.

Read Post

Scout

Read more about Error Monitoring for Elixir: Now in Scout APM

Operations | Monitoring | ITSM | DevOps | Cloud

Announcing the Datadog Terraform provider v4.0.0

Instrument zerocode observability for LLMs and agents on Kubernetes

Monitor Model Context Protocol (MCP) servers with OpenLIT and Grafana Cloud

Observe your AI agents: Endtoend tracing with OpenLIT and Grafana Cloud

How to monitor LLMs in production with Grafana Cloud,OpenLIT, and OpenTelemetry

Best On-Call Management Software for Teams that Need Faster Response Time

Introducing MicroCloud Cluster Manager

Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

Best Incident Management Tools & ITSM Practices to Reduce MTTR in 2026

Error Monitoring for Elixir: Now in Scout APM

Monthly Archive

Follow Us