Operations | Monitoring | ITSM | DevOps | Cloud

Monitor schema health with engine.schema_fields: Structure, Drift, and Volatility

If you’ve worked with an observability pipeline, you’ve probably experienced schema problems: a field disappears, a type shifts from string to number, or a new label quietly appears. The causes are everywhere. Different teams adopt different naming conventions. A dependency upgrade changes the shape of a library’s log output. Over time, these small, reasonable decisions compound into schema sprawl: dashboards break, alerts misfire, and teams scramble to find out what happened.

Let's Tune Our AWS Aurora PostgreSQL Database

In case you don't know the back story, in order for me to play with radios and label it "work," I've created a PostgreSQL database running on AWS Aurora. The db is fed from API calls to aprs.fi through Lambda functions on AWS. Some of the DDL & code is mine. Some is from Claude. Neither of us paid much attention to indexing when we were putting things together.

What Your EKS Flow Logs Aren't Telling You

If you’re running workloads on Amazon EKS, there’s a good chance you already have some form of network observability in place. VPC Flow Logs have been a staple of AWS networking for years, and AWS has since introduced Container Network Observability, a newer set of capabilities built on Amazon CloudWatch Network Flow Monitor, that adds pod-level visibility and a service map directly in the EKS console.

How A Finance Director Found $30K/Month In AI Savings In 10 Minutes

A real workflow showing how Claude + CloudZero MCP turns plain-English questions into actionable cost intelligence — no dashboards, no tickets, no waiting As Director of Finance and Accounting at a software company, my job can be described simply: Understand what we’re spending, who’s responsible, and whether we can get more efficient. But as anyone who’s had to wrangle AI costs knows, doing so for AI is anything but simple.

Komodor Introduces Extensible, Autonomous Multi-Agent Architecture for AI-Driven Site Reliability Engineering

Out-of-the-box and bring-your-own AI agents that encode operational knowledge boost troubleshooting speed and accuracy across cloud native infrastructure TEL AVIV and SAN FRANCISCO, March 18, 2026 — Komodor, the autonomous AI SRE company for cloud-native infrastructure, today announced a new extensibility framework that transforms its Klaudia AI technology into a universal multi-agent platform for troubleshooting and optimizing performance of complex cloud native infrastructures and applications.

Open standards in 2026: The backbone of modern observability

Open source software and open standards are now an essential part of how organizations maintain their systems. That's not to say they haven't always been important, but the fourth annual Observability Survey, brought to you by Grafana Labs, shows just how deeply the shift to open has taken hold, with 77% of respondents saying open source and open standards are important1 to their observability strategy.

AI in observability in 2026: Huge potential, lingering concerns

The role of AI in observability is evolving rapidly, but the data from our fourth annual Observability Survey makes one thing abundantly clear: the potential is real, and so are the reservations. Practitioners overwhelmingly see value in using AI to help surface anomalies, forecast and spot trends, assist with root cause analysis, and get new users up to speed quicker.

How Catalog changes the game for long-term maintenance

Every incident platform needs to know who owns what. Which team owns which service. Which backlog to send follow-ups to. Which escalation path to page when something breaks. The problem is that most platforms encode this ownership logic separately in every configuration: alert routing, workflows, ITSM ticket syncing, and more. Each one maintains its own copy of the same information, in its own format.

How to design cloud environments for AI-powered threat analysis

Cloud environments generate high volumes of security signals every day. With each one, you have to determine if it’s benign, a clear false positive, or something worth investigating. The challenge is needing to make these calls continuously, often without knowing whether any single event is part of a larger attack. Spending too much time investigating benign activity reduces the ability to detect threats elsewhere, and missing a legitimate threat has clear consequences.