Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Easily connect any AI assistant (Claude, Codex, ...) to your Oh Dear data

Oh Dear keeps a watchful eye on your websites: uptime, performance, SSL certificates, broken links, DNS, cron jobs. If something can quietly break, we're already checking it for you. Today we're connecting that data to a new place: your AI assistant. We just shipped an MCP integration. If you use Claude, Cursor, or any other client that speaks the Model Context Protocol, you can now ask questions like "any broken links on my site?" or "when does my certificate expire?" in plain language.

Making Semantic Conventions Work for You With OpenTelemetry Weaver

Your dataset has hundreds of attributes. Some are self-explanatory: http.response.status_code, server.address. Others are not: meta.refinery.reason, dataset.slug, sli.latency_target_ms. If you don't know what an attribute means, you can't write a good query. And if an AI agent doesn't know what it means, it guesses.

What is an Enterprise Knowledge Graph? Definition, Benefits, and Use Cases

Are your AI systems giving answers your teams cannot trust? Most enterprises deploy LLMs expecting reliable outputs, but the results often feel inconsistent or incomplete. The problem is the missing structure behind it. Enterprise data is usually fragmented across multiple systems, teams, and tools. Your AI does not understand how customers, products, policies, and operations connect. Without that context, it fills gaps with assumptions, which leads to unreliable results.

What is AI Agent Orchestration? Concept + How It Works

Have you tried using AI at work and felt it works well for small tasks, but not beyond that? It can handle simple things like creating a summary, writing a draft, or answering a question. This works because the task is clear. But most tasks are not that simple. They involve multiple steps. One step depends on another. Data comes from different systems, and some decisions need checks before moving ahead. This is where a single AI system starts to struggle.

Managing OpenTelemetry at Scale: Why OTel Pipelines Need a Control Plane

OpenTelemetry made telemetry possible everywhere – turning observability pipelines into distributed production infrastructure. Distributed infrastructure requires a control plane for inventory, governance, and safe change. At 500 collectors across hybrid environments, operational overhead becomes a production risk. The moment telemetry pipelines become a distributed infrastructure, they inherit the operational problems of one.

Geo Maps: See Where Your Infrastructure Lives

When your infrastructure is spread across regions, data centers, branch offices, or edge locations, knowing where a node is physically located matters more than people usually admit. During an incident, “the node in the Singapore POP” communicates faster than a hostname. When you’re planning capacity, seeing geographic clustering tells you something that a flat list of nodes doesn’t.

Avantra 26: A Breath of Fresh Multi-Tenant AIR

There’s a crackle and spark in the air at Avantra lately, and I’m so pleased to be writing this bit on what we’ve accomplished with the Avantra 26 release. Automated root cause analysis, multi-tenant management support for Cloud ALM, enhanced security operations and financial operations monitoring BTP – it’s all there, and more. It’s an exciting and innovative release for Avantra!

AWS outage takes down more than 150 cloud services

On May 7th and 8th, 2026, Amazon Web Services (AWS) experienced an outage affecting Amazon Elastic Compute Cloud (EC2) in the dreaded US East 1 region. The original region of AWS located in Northern Virginia, us-east-1 or just “US East” as it is known, has been the subject of some of the internet’s most high profile and destructive outages and remains Amazon’s least reliable region.

Operational Intelligence and the Hidden Structure in System Logs

Most IT teams do not suffer from a lack of data. They suffer from the amount of effort required to make sense of it. Every network device, application, cloud service, and infrastructure component generates a constant stream of machine output. Logs capture state changes, failures, retries, warnings, and thousands of other small signals about how systems behave. The problem is that raw logs are hard to use at operational speed.