Operations | Monitoring | ITSM | DevOps | Cloud

Introducing the StatusGator MCP Server

Your AI agents can now monitor, triage, and respond to cloud outages autonomously. The way enterprises manage cloud infrastructure incidents is changing. AI agents are no longer just chatbots answering questions — they’re becoming first responders in your incident management pipeline. Today, we’re launching the StatusGator MCP Server, giving AI agents direct, structured access to the full power of StatusGator’s cloud status monitoring platform.

AI Needs Better Inputs: Why Observability Is Becoming the Foundation of Enterprise AI Maturity

Organizations across industries are accelerating their investments in AI for operations, yet the path to meaningful impact is proving far more complex than early expectations suggested. Analysts at Gartner, Forrester, Deloitte, and McKinsey continue to highlight the same structural barrier. AI cannot produce accurate predictions or safe automation when the operational data feeding it is fragmented, incomplete, or inconsistent.

Fear, Identity & Flaky Tests: AI in Reliability w/ Dana Lawson (CTO, Netlify)

The self-healing systems that SREs have dreamed about for a decade aren't a distant promise anymore — they're already being built, and the biggest barrier left is cultural. Dana Lawson, CTO at Netlify, has spent over 25 years in the trenches of developer infrastructure, from sysadmin roots to running the platform that powers 5% of the internet.

KubeCon Europe 2026: AI Is Shipping Code Faster Than Orgs Can Govern It

KubeCon + CloudNativeCon Europe 2026 recently brought the cloud native community to Amsterdam. We were there all week bouncing between the booth, a Braintrust event with engineering leaders from across the community, and more hallway conversations than we can count. One talking point dominated the week: AI is shipping code faster than most engineering orgs can govern it. It also became clear that we weren't the only ones talking about this challenge.

Harness Ships Five Capabilities to Power Confident Releases at AI Speed | Harness Blog

The pace of AI-assisted development has outgrown how most teams actually ship. Harness is closing that gap. Engineering teams are generating more shippable code than ever before — and today, Harness is shipping five new capabilities designed to help teams release confidently. AI coding assistants lowered the barrier to writing software, and the volume of changes moving through delivery pipelines has grown accordingly. But the release process itself hasn't kept pace.

Pull Request Velocity as a Proxy for AI Usage for Software Development

While AI have usage has been growing steadily for the last several years, the LLM models noticeably improved around the end of 2025. Specifically, they become more viable for software development. We are seeing the results. The feature and product delivery has picked up. One way to visualize this is by looking at the number of pull requests for your organization / software development teams. This chart shows the number of Github pull requests created by a team. Can you spot when AI usage increased?

Accelerate Your OpenTelemetry Migrations With Honeycomb's Agent Skills

Since releasing our hosted MCP server last year, we've been thrilled to see customers not just adopt it but build Honeycomb deeply into their agentic development and observability workflows. Users have embraced it, leveraging Honeycomb to stay in conversation with their code and understand how it runs in production.

Mastering AI Prompts: How to Get the Best Out of SQL Prompt AI | The Tony and Tonie show Ep41

How to get the most value from SQL Prompt AI in day-to-day work, whether you're writing new queries or improving existing code. A little prompt-writing knowledge goes a long way with SQL Prompt AI. Tony and Tonie discuss how to build reusable prompts that give the tool the context it needs to return useful results first time.

AI Coding Agents Break What Works

Your AI coding agent just made every test pass. Ship it, right? Not so fast. A growing class of AI-generated bugs doesn’t come from writing bad code. It comes from the AI changing working code to accommodate its own mistakes. This isn’t a theoretical risk. It’s happening now, in production codebases, and it’s harder to catch than any bug the AI might introduce from scratch.

The SaaS Paradox: Why Companies Must Spend More On AI To Survive

At SaaS Metrics Palooza 2025, CloudZero CEO Phil Pergola delivered a keynote on the software industry’s most pressing question: can SaaS survive the AI revolution, or will AI rewrite the SaaS playbook outright? Phil’s answer wasn’t doom and gloom, but he didn’t sugarcoat the challenges. “Churn rates are up,” he told moderator Ray Rike of Benchmarkit on Oct. 9, 2025. “The payback from a customer acquisition cost perspective is taking longer.

Claude Livecaster Is Now Open Source, Plus a Two-Voice Broadcast Mode | CircleCI Loop Lab

Claude Livecaster is now public on CircleCI Research. In this update, Ryan Hamilton walks through the newly open-sourced repo, seven built-in simulation scenarios, and a new two-voice broadcast format featuring an anchor and a field correspondent narrating the action together. The demo scenario: Pipeline Wars, six CI pipelines racing across three providers, with Claude providing live color commentary on every Docker build failure, OOM kill, and production rollout.

We Made Claude Narrate an AI Model Race Like a Sports Commentator | Loop Lab

What if you didn't have to stare at logs while your AI agent worked? In this Loop Lab experiment, Ryan Hamilton built Claude Livecaster, a tool that gives Claude a live voice to narrate long-running agentic processes like a sports commentator. The demo: six AI models (GPT, Gemini, and Claude variants) race through a CI/CD benchmark, and Claude calls the whole thing play-by-play. Rate limit hits, comeback stories, photo finishes, all of it, out loud.

Top 7 AI/ML Development Companies for Enterprise Solutions in 2026

By 2026, most enterprises have moved beyond the proof-of-concept stage of AI. A demo may be easy to deliver, but deploying an autonomous agent in a production environment introduces challenges around data sanitization, system integration, and inference cost management.

The Modern Incident Management Playbook: From Alert Fatigue to AI-Driven Orchestration

A complete guide to modern incident management and how it’s transforming into a strategic business function. Kamalesh Srikanth , Product Strategy Leader at AlertOps If you’ve worked in IT, infrastructure, or operations for any length of time, you’ve lived through the chaos of a critical incident. Systems down, alerts blaring, Slack pinging, emails piling up and somewhere in that noise, your team is trying to figure out what actually broke and how to fix it fast.

Enhancing our API for better agentic consumption

AI coding agents like Claude Code and Codex are becoming a real part of developer workflows. They don't just write code, they call APIs, interpret responses, and take action based on what they find. That means the quality of your API responses directly affects how useful an agent can be. We've shipped a series of improvements to the Oh Dear API with this in mind. Every change helps humans too, but we specifically optimized for how agents consume and reason about data.

Transform ticket hell into smooth operations #ITSM #AI

Infraon ITSM uses advanced "ai" capabilities to manage operational noise, significantly boosting "business efficiency". It features a robust "ticketing system" and "sla" management for prompt resolutions, alongside self-service portals and a comprehensive "knowledge base" to enhance the "service desk" experience.

How Much Does It Cost To Keep Up With The AI Joneses?

I’ve been an engineering leader for over a decade, and I’ve spent most of those years in private Slack groups with other engineering leaders, comparing strategies and kvetching about Kubernetes. Of the hundreds of threads I’ve taken part in, the one that got the most engagement the fastest was a recent one around AI adoption. “Where are you on this continuum?”, it read. “A. You don’t really care how people use AI; B. You push people to use AI; or C.

Winning in the AI Era: How Top Teams are Driving Their Velocity Gains with Alloy & Chime

While most teams struggle with the complexity of AI-generated code, Alloy and Chime have built internal cultures and processes that enable them to scale their development while maintaining quality. Join CircleCI’s CTO, Rob Zuber, in conversation with Maciej Makowski, Senior Software Developer at Chime, and Sunny Singh, Senior Software Engineer at Alloy, as they explore the dynamics that set their teams apart. They'll talk through the culture and delivery practices that actually moved the needle.

Observability and Security for the AI Era

Datadog has always been driven by a broader vision of helping teams understand and operate complex systems. In this session, you’ll hear from Yrieix Garnier, VP of Product, and Hugo Kaczmarek, Senior Director of Product, as they share the latest updates across the Datadog product suite and discuss how that vision continues to shape the platform’s evolution and support the next generation of AI-driven applications.

AI Cost Management: How To Track, Allocate And Optimize AI Spend

AI cost management is the practice of tracking, allocating, and optimizing the cloud infrastructure costs tied to building, running, and scaling AI workloads. It differs from traditional cloud cost optimization because AI infrastructure behaves differently at every layer of the stack. The biggest problem isn’t overspending. It’s that most organizations can’t see where their AI spending is going.

Investors Balance Growth Potential and Structural Risks in Apple Ecosystem

The smartphones, smart devices, and ecosystem services market remains under pressure due to technological limitations and ongoing structural changes at companies such as Apple. Despite a 4% decline in smartphone sales in China during the first two months of 2026, the company managed to increase iPhone sales by 23%, driven by seasonal discounts and subsidies on the base iPhone 17 model.

How to Translate YouTube Videos: Tools and Best Practices

Most creators don't think about translation until they open their analytics one day and see traffic coming in from Brazil, Germany, or Japan. And they just sit there staring at it like, wait, people actually want to watch this? In a different language? That's usually the moment it all clicks. The good news is that tools built to translate YouTube video content have gotten genuinely good. Not impressive for a computer good. Actually, it's good. Dubbed audio that sounds natural, lip sync that holds up, and a workflow that doesn't require a team or a big budget to pull off.
Sponsored Post

AlmaIQ brings unparalleled level of efficiency and effectiveness for IT teams using Collective IQ

AlmaIQ, the intelligent self-service agent for employees just received an incredible boost that expands its role to uniquely help IT teams. Interacting with users through Microsoft Teams, AlmaIQ answers questions about devices and internal processes in natural language. Whereas that intelligence simplified employees lives on the job, it now enables IT teams to interact with Collective IQ at the level of departments, groups, and collections of devices to spot patterns and trends. The overall result: vastly more productive operations and satisfied employees.

Women's Day Panel: Navigating the Future of Engineering in the Age of AI

How is AI reshaping engineering—and what does it mean for the future of work? At our first GTA Boston Hub event of the year, we brought together engineering leaders from Boston Consulting Group and Athenahealth to dive into one of the most pressing topics today: the rise of generative AI. In this panel, we explore: Key takeaway: This isn’t “human vs AI”—it’s human augmented by AI. The real advantage lies in how we adapt, collaborate, and lead in this new era.

Groq vs. GPUs: The future of AI inference in 2026

Back in 2016, Jonathan Ross founded Groq, the AI chip startup, which went on to enter a non-exclusive licensing agreement with NVIDIA for Groq’s inference technology (as part of a $20 billion deal). The name ‘Groq’ is commonly confused with X (formerly Twitter)’s Grok, which was launched in 2023 as a Gen AI chatbot. As demand for real-time AI continues to grow, inference has become one of the most important and expensive parts of the machine learning lifecycle.

Why This Fortune 500 Chose Agentic AI Over Traditional AIOps

What does real enterprise-ready Agentic AI look like in production? In this video, we break down how a Fortune 500 enterprise used Fabrix.ai’s Agentic AI platform to detect, diagnose, and resolve a critical application issue in just 5 minutes—without moving their data or replacing existing tools. If you're exploring Agentic AI, AIOps, or enterprise automation, this is a must-watch.

Getting Scout Data Into Your AI Workflow

If you’ve spent any time in developer tooling lately, you’ve probably noticed a pattern: every product is rushing to add a chatbot, an AI summary, or some kind of “magic” button. We get it — it’s tempting. But at Scout, we’ve been deliberately taking a different approach. Instead of building AI into our product first, we’ve focused on making Scout’s data accessible to the AI tools you’re already using.

QA, AI, and the return of the adversarial mindset

The best QA engineers are always asking themselves (and others around them) what might break. When engineering teams shifted to agile delivery, that mindset largely moved out of dedicated roles and into the background. Automated testing took over the repetitive work, developers owned quality end-to-end, and velocity improved. What didn't carry over was the habit of looking at a feature and asking how a real user, an edge case, or unexpected load might expose it.

#054 - From Shiny Objects to FinOps: Taming Cloud Costs in the AI Era with Josh Schlanger (CloudX...

In this episode of the Kubernetes for Humans podcast, we are joined by infrastructure and FinOps expert Josh Schlanger. Drawing on over 15 years of experience across Martech, e-commerce, and health tech, Josh shares why solving core business problems should always take priority over chasing new, "shiny object" technologies.

Jensen Huang's warning: lead the AI transition - or finance it

The wrong people got the most attention from Jensen Huang’s comments last week. Huang told the All-In Podcast that he’d be “deeply alarmed” if a $500,000 engineer consumed less than $250,000 in AI tokens annually. Within 48 hours, the discourse collapsed into a compensation debate.

AI Deployment in Production: Orchestrate LLMs, RAG, Agents | Harness Blog

For the past few years, the narrative around Artificial Intelligence has been dominated by what I like to call the "magic box" illusion. We assumed that deploying AI simply meant passing a user’s question through an API key to a Large Language Model (LLM) and waiting for a brilliant answer.

LiteLLM Compromise: Securing AI Pipelines from PyPI Supply Chain Attacks | Harness Blog

On March 24, 2026, the AI open-source ecosystem was impacted by a critical supply chain attack involving the widely used Python package LiteLLM. Attackers compromised the LiteLLM PyPI distribution pipeline and published malicious versions (notably in the 1.82.7-1.82.8 range), embedding a multi-stage payload designed to steal credentials and execute remote code.

Datadog achieves ISO 42001 certification for responsible AI

As AI-powered products and services become central to how organizations operate, the need for responsible AI governance has never been greater. Customers, partners, and regulators are seeking assurance that AI systems are built, managed, and monitored responsibly and effectively. Datadog is committed to the responsible use of AI, both in how we build our products and in how we help customers observe their AI workloads.

Introducing Bits AI Dev Agent for Code Security

As organizations adopt AI-assisted development and increase their release velocity, they are not only generating more code but also finding more vulnerabilities from static analysis. The traditional remediation workflow of manually triaging issues, creating tickets, and opening individual pull requests (PRs) cannot keep pace. Fixing tens of thousands of vulnerabilities one by one is not a viable remediation strategy.

How to Reduce MTTR with AI

The quick download: AI reduces MTTR by helping teams detect issues sooner, pinpoint root causes faster, and resolve incidents with less manual effort. IT downtime costs organizations an average of $9,000 per minute. AI-powered observability can cut incident resolution time by up to 70%. Here’s what it takes to get there. Every minute an incident goes unresolved, the meter is running.

Checkly and the Agentic Software Layer

November 24th, the Opus 4.5 release turned around the entire tech industry. This was the moment when agents became capable. Capable enough to write solid staff-level code. Capable enough to reason about alerts, investigate root causes much faster than most engineers, and set up the reliability layer faster. For me, this feels like an iPhone moment on steroids; the adoption of AI is accelerating much faster than any adoption curve I’ve seen over the past few decades.

The Role of Automation in Modern Financial Planning

Look, the financial sector's evolving at breakneck speed. If you're clinging to manual processes, you've probably noticed the pressure mounting. Today's financial planning landscape bears little resemblance to what existed even five years ago. Clients demand immediate responses, markets pivot without warning, and honestly, spreadsheet mistakes just aren't acceptable anymore.

What Are AI Inference Costs? [And How To Manage Them]

If you’re building or running AI-powered features in production, you need a clear understanding of inference costs. Get it right, and you can turn your AI investments into profitable growth. As Larry Advey, Director of Cloud Platform and FinOps at CloudZero and a member of the FinOps Foundation Technical Advisory Council, puts it: “AI investments will only continue to grow.

NVIDIA DGX vs. NVIDIA HGX: What is the difference?

While GPUs remain among NVIDIA's flagship products, they also offer a range of other compute products beyond the dedicated graphics cards for which they are known. If you are unfamiliar with the words DGX or HGX, this blog is for you. Throughout this blog, we will cover what these terms mean in practice and when you should be using them.

ROI of AI: How CIOs Measure Real Business Impact

Since the advent of Artificial Intelligence (AI), it has become the buzzword for modern day businesses. It has tremendous benefits which has lured enterprises invest hefty money with a view of getting ahead of their competitors. Yet, many CIOs are still figuring out ways to get the best ROI of AI that resonates with their businesses. While there are many initial programs and proof of concepts that show promise, in the long run they fail to deliver their promise.

Securing the Future: Scaling AI, Sovereignty, and Resilience in ANZ ITOps

Enterprises in Australia and New Zealand are accelerating AI adoption, driven by strong digital trust frameworks. To remain competitive and compliant, the IT Operations (ITOps) landscape must evolve to manage hybrid complexity and persistent cyber risks. Join us for an exclusive, in-depth webinar as IDC and SolarWinds explore the strategic investments and unique challenges shaping future-proof ITOps across the ANZ region.

How Harness AI Helps Scale Platform-Wide Support | Harness Blog

--- Key Takeaway: Harness AI helped deflect 95% of the platform support tickets for a major financial institution --- These days, success is often measured by what doesn’t happen: When things go right, the software delivery platform is invisible. But what happens when an organization’s delivery velocity increases multifold? Can the platform still stay out of the way?

An Oh Dear skill for use in Claude Code or Codex

AI coding agents are getting good at calling tools. Claude Code, Codex, and others can run shell commands, parse JSON, and reason about the results. But they need to know what tools are available and how to use them. That's what skills are for. A skill is a small package of documentation that teaches an AI agent how to use a specific tool. We've built one for Oh Dear.

Smarter Alerts, Faster Root Cause, & Proactive IT Ops with SolarWinds AI Observability

Discover how AI is transforming IT operations with SolarWinds Observability. In this video, we showcase powerful new AI-driven features designed to help you detect issues faster, reduce alert noise, and stay ahead of performance problems across your entire stack. From applications and databases to networks, cloud infrastructure, and end-user experience SolarWinds AI delivers deep insights where it matters most.

When Code Becomes Cheap: The New Reliability Constraint in Software Engineering

For most of the history of software engineering, the primary constraint was production. Code was expensive, skilled engineers were scarce, and shipping features required concentrated human effort. Velocity was limited by how fast people could reason, implement, test, and deploy. That constraint shaped everything from team size, architecture, release cadence, through to how we thought about technical debt. When production is expensive, you optimise for output. You remove friction from shipping.

CloudZero Brings Cloud Cost Intelligence to 13 AI Coding Tools - Cursor, Copilot, and More

Earlier this month, we announced the CloudZero Claude Code Plugin and the CloudZero AI Hub — the first step toward putting your cloud cost data directly inside the AI tools your team already uses. The feedback from customers was clear. They said engineers and FinOps teams wanted more tools and more ways to get answers from CloudZero without switching context. Today, we’re delivering more.

7 Techniques Supporting Consistent Quality Across Web Graphics

Digital media moves fast. Maintaining a visually appealing site requires a well-defined plan. High-quality graphics build trust with your users. They keep them engaged longer. When images look pixelated or messy, your professional image suffers. You need a set of rules to keep every visual element looking its best. These techniques help you manage assets without losing speed or clarity. Focusing on a few key areas makes a big difference in how your audience sees your work. Let's explore how to maintain sharp and professional web graphics.

5 Ways ShyftOff Simplifies Contact Center Operations and Improves Customer Experience

Contact centers are at the heart of customer perception regarding a certain brand. For instance, if the experience is positive, the customer feels that he or she is being well cared for. However, it is not an easy task to manage agents, balance the volume of calls, and ensure that the service is of high quality. Many organizations face difficulties in scheduling, performance measurement, and making sure that each customer is served in an efficient manner. ShyftOff is here to help organizations deal with these complexities in an intelligent manner that will improve the customer experience.

Emerging Cyber Threats Every Organization Should Know

Cyber threats in 2026 are evolving faster than most organizations can comfortably manage. Attackers are using automation, artificial intelligence, and scalable attack models to target businesses of every size. What used to be handled in isolation by IT teams is now a boardroom concern. A single breach can disrupt operations, damage trust, and create long-term financial consequences. Leaders are starting to recognize that cybersecurity is not just about tools but about strategy, governance, and accountability across the organization.

Meet Your Virtual Responder: PagerDuty's SRE Agent for AI-Driven Reliability

Modern SRE teams face an overwhelming challenge: too many signals, too little time. Incidents are faster, systems are more complex, and reliability targets only get stricter. What if you had a teammate who could jump in instantly—context-aware, tireless, and armed with your runbooks, metrics, and alert data? Introducing PagerDuty’s SRE Agent, the next evolution in AI-driven operations.

How a Runtime Aware AI SRE Agent Transforms System Reliability

A runtime aware AI SRE extends existing AI SRE approaches by moving beyond telemetry correlation into runtime-validated reliability. While the majority of AI SRE tools accelerate incident triage using logs, metrics, and traces, they cannot confirm execution behavior if critical runtime signals were never captured. By generating on-demand evidence inside running services, AI SRES can eliminate slow redeploy cycles, ensuring your distributed systems remain resilient under real-world traffic conditions.

AI, Anxiety & 400 Open Windows: GEOFF WRIGHT RETURNS

Geoff Wright returns to unpack the messy reality of work in the AI era. From having 400 windows open and feeling less productive, to explaining why AI should fuel curiosity rather than replace human judgment, Geoff brings his usual mix of optimism, humor, and hard-earned perspective. The conversation explores prompt engineering, digital overwhelm, enterprise adoption, and why “being human first” matters more than ever. It is a wide-ranging, thoughtful discussion on anxiety, complexity, and the promise of AI, with a surprisingly funny detour into why the robots might eventually just leave Earth for Pluto.

Multi-Agent AI SRE Has Landed and Its Built for Your Most Complex Stacks

Once upon a time, a monolith running on a handful of servers meant that incident management, even at 2:17 AM, was something a single generalist could handle. One person with enough context across the stack could reasonably diagnose whether the database was choking, a config had changed, or a server was running hot. They’d fix it and go back to sleep.

Stop Vibe Coding Everything: The Case for Spec-Driven Dev

Spec-driven development with AI coding agents could change how you build software. In this GitKon 2025 talk, Erik Hanchett, Senior Developer Advocate at AWS, breaks down why AI coding assistants perform dramatically better when they start with structured specifications instead of raw prompts. If you've been vibe coding your way through complex features and wondering why your AI keeps going off the rails, this is the video for you.

AI in DevOps: How MCP and Puppet Are Changing Infrastructure Automation

AI adoption in DevOps is accelerating, but trust, accuracy, and real-world usability still matter. In this conversation, Jason St-Cyr sits down with Jessica Gao, Product Manager at Puppet, to unpack how AI is actually being used in infrastructure and operations teams today, and what’s changed over the last 12–18 months. They dive into why enterprises are moving past generic code generation tools and toward domain-specific, MCP-powered AI that integrates directly into existing workflows.

Nano Banana 2 API in Production: Real Use Cases and Why APIPASS Makes It Accessible

That first question is not which of the models in Google's Nano Banana model family looks better on a benchmark, but instead, which should you actually ship with? Nano Banana Pro has always had the luxury edge: higher reasoning, maximal photorealism, studio-grade fidelity. Nano Banana 2, based on Gemini 3.1 Flash Image, came with an entirely different promise - the Pro-world knowledge and output quality to Flash-speed infrastructure at penny-pinch levels of pricing.

FastAPI Testing: Mock LLM APIs for Free

Testing a FastAPI app that calls OpenAI, Anthropic, or Gemini gets expensive fast. The problem is not just the API bill in production. It is all the repeated traffic in development: prompt tweaks, CI runs, regression checks, and the load tests you keep putting off because every run burns tokens. Hand-written mocks do not help much once the app is doing multi-step LLM work.

Birol Yildiz on Autonomous Incident Response and the Future of AI SRE | Harness Blog

At SREday NYC 2026, the ShipTalk podcast welcomed Birol Yildiz, Co-founder and CEO of ilert, for a conversation about the next evolution of incident response. In the episode, ShipTalk host Dewan Ahmed, Principal Developer Advocate at Harness, spoke with Birol about how artificial intelligence is transforming reliability engineering—from simply assisting engineers during incidents to autonomously diagnosing and resolving outages.

Observability Lessons From OpenAI

Writing code is moving from the good old IDE into the realm of autonomous AI agents. One example of this is OpenAI, which has been developing internally with 0 lines of manually written code. You can read about their workflow in their engineering blog: Harness engineering: leveraging Codex in an agent-first world. For me, the main takeaway of OpenAI’s article is how AI has rewritten the constraints equation.

70% to 90% of AI Projects FAIL. Here's Why.

Why are so many modern AI initiatives falling short of their ROI? In this episode of iOPEX, Malcolm Lett (Technical Lead) breaks down the critical mistakes companies make when implementing AI and how to choose the right tools for real success. Most organizations treat Generative AI as a "one-size-fits-all" solution, but it’s only one piece of the puzzle. Malcolm explores the four essential domains you need to balance to build a winning strategy.

How Vibe Coding A Self-Help App Made Me An AI Believer

For longer than I’m proud of, I was an AI skeptic. Then, over the holidays, I vibe coded an app whose sole purpose was to make me a better person. The app is a motivator. It’s programmed to send me timely reminders along certain themes, like reading every day, making healthy eating choices, and giving myself plenty of time to plan for anniversaries and birthdays.

NVIDIA's Jensen Huang just described your next big cost problem

On March 18, Jensen Huang took the stage at NVIDIA’s GTC conference in San Jose for a keynote that ran well over two hours — covering everything from CUDA’s 20-year history to humanoid robots that may one day wander Disneyland. But buried inside the spectacle was a remarkably clear-eyed articulation of the economic forces now bearing down on every enterprise that builds on cloud infrastructure.

Annotate traces to improve LLM quality with Datadog LLM Observability

LLM applications rarely crash. They degrade quietly. Once these applications are shipped to production, subtle quality failures become harder to catch with traditional signals. Tone shifts, hallucinated details, off-topic responses, and incomplete reasoning can emerge while latency and token usage look stable.

Why AI Driven Automation Can't Wait

Operators today are navigating unprecedented complexity—rising costs, accelerating customer expectations, and increasingly dynamic networks. In this recent video interview, my colleague Kevin Wade and I explore why AI‑driven automation has shifted from a “nice‑to‑have” technology to a core business requirement for telecom operators and beyond.

How OpenRouter and Grafana Cloud bring observability to LLM-powered applications

Chris Watts is Head of Enterprise Engineering at OpenRouter, building infrastructure for AI applications. Previously at Amazon and a startup founder. As large language models become core infrastructure for more and more applications, teams are discovering a familiar challenge in a new context: you can't improve what you can't see.

Introducing Calico Load Balancer and Seamless VM-to-Kubernetes Migration

SAN JOSE, Calif., March 23, 2026 — Tigera, the creator and maintainer of Project Calico, today announced a major expansion of its Unified Network Security Platform for Kubernetes, aimed at helping enterprises consolidate infrastructure and accelerate the migration of legacy workloads to cloud-native platforms.

The Hidden AI Bill: Why Non-Prod LLM Costs Spiral

Most teams know they are spending money on AI in production. Far fewer realize how much they are spending outside production. It’s easy to get lost as you evaluate which model has the best responses, is fast enough, and cheap enough to run in production. That is because the AI bill usually shows up as a giant blob. It is easy to see the total.

FinOps Leaders Who Will Win The AI Era Are Already Experimenting

Engineering teams are shipping faster than ever. AI coding tools like Claude Code and OpenAI’s Codex have quietly removed some of the biggest friction points in the development cycle — and the result is that FinOps teams are being asked to keep up with a pace most practitioners haven’t fully reckoned with yet. That acceleration has a cost consequence. More shipping means more services, more experiments, more infrastructure spun up without review cycles.

Instrument zerocode observability for LLMs and agents on Kubernetes

Building AI services with large language models and agentic frameworks often means running complex microservices on Kubernetes. Observability is vital, but instrumenting every pod in a distributed system can quickly become a maintenance nightmare. OpenLIT Operator solves this problem by automatically injecting OpenTelemetry instrumentation into your AI workloads—no code changes or image rebuilds required.

Monitor Model Context Protocol (MCP) servers with OpenLIT and Grafana Cloud

Large language models don’t work in a vacuum. They often rely on Model Context Protocol (MCP) servers to fetch additional context from external tools or data sources. MCP provides a standard way for AI agents to talk to tool servers, but this extra layer introduces complexity. Without visibility, an MCP server becomes a black box: you send a request and hope a tool answers. When something breaks, it’s hard to tell if the agent, the server or the downstream API failed.

Observe your AI agents: Endtoend tracing with OpenLIT and Grafana Cloud

In another post in this series, we discussed how to instrument large language model (LLM) calls. This can be a good starting point, but generative AI workloads increasingly rely on agents, which are systems that plan, call tools, reason, and act autonomously. And their non‑deterministic behavior makes incidents harder to diagnose, in part, because the same prompt can trigger different tool sequences and costs.

How to monitor LLMs in production with Grafana Cloud,OpenLIT, and OpenTelemetry

Moving a large language model (LLM) application from a demo to a production‑scale service raises very different questions than the ones you ask when playing with an API key in a notebook. In production, you have to answer: How much is each model costing us? Are we keeping latency within our service‑level objectives? Are we accidentally returning hallucinations or toxic content? Is the system vulnerable to prompt‑injection attacks?

Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

Seer is our AI agent that takes bugs and uses all of the context Sentry has to find the root cause and suggest a fix. We use it all the time to help us improve Sentry. Seer fixes Sentry. More recently, Seer has been helping us fix itself — Seer fixing Seer. An upstream outage triggered a bit of an avalanche, revealing a bug that had been hiding away for months. When it came time to fix it, Seer pointed us exactly where we needed to look.

Harness AI for Argo CD

Managing GitOps at scale shouldn’t feel like an endless game of "Whac-A-Mole." In this 3-minute demo, we show how Harness AI moves beyond simple syncs to provide agentic troubleshooting and automated orchestration for your entire GitOps estate. Watch as we use the Harness DevOps Agent to: Identify Common Failure Patterns: Instead of clicking through individual clusters, we ask the AI to analyze 4 out-of-sync applications simultaneously.

What's New in Turbo360 - AI agents for Azure cost optimization, Azure cost pulse summary report...

Turbo360 brings a suite of enhancements added to elevate your Azure management experience. Hit play to hear what's in store for this month. 00:00:00 - Intro 00:00:13 - Cost Pulse Summary Report 00:00:49 - Configuring Cost Pulse Summary 00:01:17 - New AI Agents (4 New Agents) 00:01:54 - Accessing AI Agents 00:02:18 - Related Resources Feature 00:02:40 - Budget Planner 00:02:59 - Setting Up Budget Planner Permissions 00:03:11 - Multi-Subscription Onboarding 00:03:43 - AI Agents Role-Based Access 00:04:10 - New RA-GRS Optimization Recommendation 00:04:30 - Summary & Call to Action.

Our key takeaways from NVIDIA GTC 2026

Every year, NVIDIA GTC offers a glimpse into the future of computing. But this year felt different. The conversations from the past few days point to something bigger than faster GPUs or larger models. The industry is shifting its mindset entirely. GTC 2026 made it clear that the goalposts for AI haven't just moved, they’ve been uprooted. We’re past the point of talking about "faster chips." Everything points to a total shift in the industry's DNA.

Agentic AI at Scale: Building the Kubex Agentic AI Platform

In the modern cloud infrastructure landscape, we don’t have a data problem; we have an actionable interpretation gap. Engineering teams are often drowning in metrics that describe a crisis without providing a clear path to remediation. Traditional FinOps, SRE, and DevOps work has become a reactive loop of dashboard-watching and manual firefighting.

How to Catch AI Code Mistakes Before They Reach Production

AI can write code fast, but it makes mistakes humans often don't. In this session from Ole Lensmar, CTO of Testkube, breaks down the real quality risks of AI-generated code and how engineering teams can build guardrails before those bugs hit production. What you'll learn: Common mistakes LLMs make (and which ones are unique to AI) Whether you're a developer leaning on AI to ship faster or a QA lead trying to keep up with the pace of AI-generated code, this talk gives you a practical framework for staying ahead of quality issues.

Claude Code is running bash commands on your infrastructure. Here's how to watch it.

I’ve been staring at Claude Code telemetry for the past few weeks, and I keep noticing the same thing: most teams drop it into their environment, say “it’s amazing,” and have absolutely no idea what it’s actually doing at the system level. That’s fine for a personal dev tool. It’s not fine when you’ve rolled it out to 50 engineers.

Architecting MCP for AI Agents: Lessons from Our Redesign | Harness Blog

-- Key Takeaways: The Harness MCP server is an MCP-compatible interface that lets AI agents discover, query, and act on Harness resources across CI/CD, GitOps, Feature Flags, Cloud Cost Management, Security Testing, Resilience Testing, Internal Developer Portal, and more. -- The first wave of MCP servers followed a natural pattern: take every API endpoint, wrap it in a tool definition, and expose it to the LLM.

Claude Code + Lightrun MCP: Your AI Agent Now Has Live Runtime Vision

Claude Code, Anthropic’s coding agent, now integrates with Lightrun through MCP. AI code assistants have been flying blind. Google Dora’ 2025 report found it is causing, an almost 10% increase in code instability. Even with up to 1M tokens of context available in Claude, this powerful agenti cannot see how the code it writes actually behaves inside a live system under real traffic, real dependencies, and under a load of 10,000 requests per second.

AI Assistant for Calico: Troubleshooting at the Speed of Thought

Despite the wealth of data available, distilling a coherent narrative from a Kubernetes cluster remains a challenge for modern infrastructure teams. Even with powerful visualization tools like the Policy Board, Service Graph, and specialized dashboards, users often find themselves spending significant time piecing together context across different screens.

What Engineers Want from AI in Observability... According to the 2026 Observability Survey Report

The results show strong interest in AI for forecasting, root cause analysis, onboarding, and generating dashboards, alerts, and queries. But when it comes to autonomous action, practitioners are more cautious — and 95% say AI needs to show its work to earn trust.

The Hidden Failure Points in Your AI Strategy

New models, new agents, new capabilities. It seems like every week there’s a new must-have AI function. It’s no surprise that leaders are feeling pressure to move quickly. At a PagerDuty on Tour event, a customer joked that they couldn’t fathom having a five-year AI strategy; it makes way more sense to have a five-minute one. There’s truth in that comment.

Buy vs Build in the Age of AI (Part 3)

In Part 1, we looked at how AI has reduced the cost of building monitoring tools. Then in Part 2, we explored the operational and economic burden of owning them. Now we need to talk about something deeper. Because the real shift isn’t just economic; it’s structural. AI isn’t just helping engineers write code faster. It’s accelerating the entire software ecosystem; including how monitoring tools are built, maintained, and trusted.

The Art of Prompting in AI Test Automation | Harness Blog

E2E Testing Has a New Bottleneck, and It's Not the Code End-to-end (E2E) testing has always been the hardest part of a QA strategy. You're simulating real users, navigating real flows, validating real outcomes across browsers, environments, and data states that never hold still. Traditional test automation tackled this with scripts: rigid, deterministic sequences tied to element selectors and hard-coded values. They worked until the UI changed. Or the data changed.

What are test hooks in AI-native development?

Summary: A test hook connects a test or lint command to an event in your AI coding agent’s workflow. When the event fires, the agent runs the command automatically. If it fails, the agent’s action is blocked. You can wire your existing test commands into your agent’s lifecycle hooks to get deterministic local validation before code ever reaches CI. AI coding agents write code at a pace where stopping to manually run tests breaks your flow.

AppSignal's MCP Server: Connect AI Agents to Your Monitoring Data

Your AI coding assistant already knows your codebase. Now it can know your production environment too. AppSignal's MCP server gives AI agents and AI code editors direct access to your monitoring data — errors, performance metrics, and more — so they can help you debug, investigate and resolve issues without switching context. And with our new public endpoint, getting started is simpler than ever.

The silent infrastructure tax: why AI agents will break your legacy cloud

For the first time in a decade, humans are the minority on the open web. In 2025, automated traffic officially crossed the Rubicon to account for 51% of all web activity, while generative AI-driven referrals to retail sites surged by a staggering 693% year-over-year. As we move through 2026, these are no longer just "bot" statistics to be handled by a WAF. They represent a fundamental shift in user behavior. The fastest-growing segment of your audience is now agentic.

AI in observability in 2026: Huge potential, lingering concerns

The role of AI in observability is evolving rapidly, but the data from our fourth annual Observability Survey makes one thing abundantly clear: the potential is real, and so are the reservations. Practitioners overwhelmingly see value in using AI to help surface anomalies, forecast and spot trends, assist with root cause analysis, and get new users up to speed quicker.

Komodor Introduces Extensible, Autonomous Multi-Agent Architecture for AI-Driven Site Reliability Engineering

Out-of-the-box and bring-your-own AI agents that encode operational knowledge boost troubleshooting speed and accuracy across cloud native infrastructure TEL AVIV and SAN FRANCISCO, March 18, 2026 — Komodor, the autonomous AI SRE company for cloud-native infrastructure, today announced a new extensibility framework that transforms its Klaudia AI technology into a universal multi-agent platform for troubleshooting and optimizing performance of complex cloud native infrastructures and applications.

How A Finance Director Found $30K/Month In AI Savings In 10 Minutes

A real workflow showing how Claude + CloudZero MCP turns plain-English questions into actionable cost intelligence — no dashboards, no tickets, no waiting As Director of Finance and Accounting at a software company, my job can be described simply: Understand what we’re spending, who’s responsible, and whether we can get more efficient. But as anyone who’s had to wrangle AI costs knows, doing so for AI is anything but simple.

Engineers Want AI in Observability - With One Catch: 4th Annual Observability Survey by Grafana Labs

Actually useful AI is welcome in observability. AI for the sake of AI is not. In this overview of Grafana Labs’ 4th annual Observability Survey, Marc Chipouras shares what 1,300+ respondents from 76 countries told us about the current state of observability — and what comes next. This year’s survey explores four major themes: The results show strong interest in AI for forecasting, root cause analysis, onboarding, and generating dashboards, alerts, and queries. But when it comes to autonomous action, practitioners are more cautious — and 95% say AI needs to show its work to earn trust.

Flow State in an AI Workplace - Digital Friction 1:1 with Mike Lovewell

Tom welcomes Mike Lovewell to explore how digital friction continues to shape the modern workplace. From early days of low awareness to today’s complex, AI-influenced environments, Mike shares how friction has evolved in scale rather than cause. They discuss the growing importance of flow state, the measurable business impact of small disruptions, and why adoption—not just technology—is the key to success. AI emerges as both a solution and a new source of friction, depending on trust and usability.

How agentic AI for ITOps overcomes observability tool gaps

As enterprise ITOps teams monitor increasingly complex, cloud-based, containerized systems, traditional observability practices are struggling to keep up. As IT infrastructure complexity increases, the typical response is to layer on more monitoring, logging, and instrumentation.

How Local-First AI Agents Are Reshaping IT Operations Automation

IT operations teams have spent the last decade embracing automation - from auto-scaling rules and CI/CD pipelines to AIOps platforms that correlate alerts across sprawling infrastructure. Yet a fundamental tension remains unresolved: the most powerful AI automation tools require you to route sensitive operational data through external cloud services you do not control.

My Room Still Looked Wrong - Until I Tried an AI Home Design Generator

I didn't expect much when I first tried an AI Home Design Generator and an AI Image to Image Generator. At that point, I wasn't trying to redesign anything seriously. I just knew my room looked... off. Not terrible, just never quite right. Every time I took a photo, something felt wrong - the layout, the lighting, maybe both.

From Data Chaos to Results: The New Data Strategy for the Agentic Era

The world is generating data at a pace that defies the human ability to draw insights and comprehend. By 2028, we’ll reach almost 400 zettabytes of global data—with over 55% of it coming from machines talking to machines. For enterprises, this isn’t just a storage problem; it’s an existential challenge.

Knowledge Graphs: The Backbone of AI-First Software Delivery | Harness Blog

--- ‍Key Takeaways --- AI can generate code in seconds. It still can’t ship software safely. That gap isn’t about model quality or prompt engineering. It’s about context, and most software organizations don’t have a system that accurately reflects how pipelines, services, environments, policies, and teams actually relate to each other. Without that context, AI doesn’t automate delivery. It amplifies risk.

Securing AI and Securing With AI: AI Security from Code to Runtime With Harness | Harness Blog

AI is changing both what you build and how you build it - at the same time. Today, Harness is announcing two new products to secure both: AI Security, a new product to discover, test, and protect AI running in your applications, and Secure AI Coding, a new capability of Harness SAST that secures the code your AI tools are writing.

5 AI And Cloud Cost Problems That Are Now Everyone's Problem

Not long ago, cloud cost was an engineering problem. FinOps teams owned it, finance leaned in occasionally, and everyone else stayed out of it. Now, that’s changed. AI changed who has skin in the game. CFOs get asked about it in board meetings. CEOs field questions on earnings calls. The audience for cloud cost management has exploded — and that means the conversation CloudZero is built to enable isn’t only a technical one, it’s a business one.

Fair Source Software in the AI age

Have you noticed AI recently? Yeah, us too. Generative AI is wreaking havoc on the software status quo, and that includes licensing, and that generates … opinions. Sentry has a long history of having opinions about software licensing. We started life as an unlicensed side project in 2008, then went through BSD, to BSL, to writing our own license, FSL.

The hidden reliability risks in your agentic AI workflows

Artificial intelligence recently took a major leap from “saying” to “doing.” Instead of simple back-and-forth chats, we’re now allowing automated AI processes to take action on our behalf—from responding to emails to building and deploying complete applications. This shift from “assistant” to “actor” can make applications more capable, but it also creates additional failure modes.

Announcing the 2026 State of AI-First Operations Report

For years, our annual State of Digital Operations report has been the industry benchmark for understanding how organizations manage incidents, build resilience, and evolve their operational practices. Each year, we survey hundreds of business and operations leaders worldwide to capture the challenges, priorities, and emerging practices shaping digital operations.

The next wave of AI: Balancing innovation with sovereignty

This blog is based on the webinar, “AI panel: The next wave of AI technology”. You can watch the full recording by clicking here! The pace of AI innovation is reshaping research, business, and everyday life. However, as breakthroughs in Large Language Models (LLMs) and high-performance computing accelerate, they bring new technical challenges around scale, efficiency, and reliability.

Re-Inventing Network Operations: Are AI Extensions the Right Path?

For decades, telecom network operations have depended on traditional OSS tools – complex, services-heavy platforms that take years to modernize and even longer to deliver measurable business impact. This year at MWC, the leading OSS vendors showcased a variety of new AI extensions for their portfolios and marketed them as the fastest path to autonomous network operations. They are not.

Event Intelligence for Agentic IT Operations

Modern IT teams are experimenting with AI agents. But individual agents, working in isolation are not enough. To truly achieve Agentic IT Operations, organisations need a platform — one that coordinates, governs, and contextualises AI-driven actions across the entire IT landscape. That’s where Interlink Software comes in.

AI Merge Conflict Resolution + Commit Messages in GitKraken Desktop

AI-assisted merge conflict resolution is changing how developers handle Git workflows. Watch GitKraken Ambassador Kevin Bost demonstrate AI-powered features that eliminate merge conflict dread, clean up messy commit history, and generate contextual commit messages in seconds.

Incident Response Reimagined: Accelerating Resolution with AI Agents

Learn how PagerDuty is leveraging Agentic AI to transform the incident lifecycle from reactive firefighting to proactive prevention. Manuel Reis, Software Developer at PagerDuty, demonstrates how new tools like the SRE Agent and Scribe Agent assist engineers during high-pressure outages by autonomously triaging alerts, querying logs in tools like Grafana, and transcribing context directly into incident channels.

Prompt, Deploy, Pray Is Dead: Validating AI Code with Proxymock

Recent outages tied to AI-assisted code changes have pushed companies into a corner. After several incidents with massive “blast radius” impacts, organizations like Amazon introduced stricter controls—mandating that senior engineers manually review all AI-generated code before it hits production. That response makes sense on paper, but it exposes a fatal flaw in the modern development pipeline.

EV Fleets Don't Fail on the Road. They Fail in the Workflow. Agentic AI Fixes That.

You spent the last decade obsessing over connectivity. You bought into the hype that ‘data is the new oil.’ You fitted your entire fleet with sensors and built massive dashboards to track everything from battery cell temperature to tire pressure. The mission was simple: Capture every metric. Congratulations, you succeeded. You are now drowning in terabytes of data. But here is the hard truth: Data without action is just expensive noise.

Test your AI model training reliability, too

Training is at the heart of every LLM model, but it’s still an application running on an infrastructure, which means it can fail. Our GPU test helps you test your training GPUs so you don’t lose that valuable work. TRANSCRIPT: One of the things we built recently was the GPU Gremlin. So if you are training a bunch of models and you're doing a bunch of GPU testing. You know, we want to give you the tools to be able to go test that, to understand how training the model could fail.

Digital Adoption + AI: The Secret Route to Zero Tickets

Generative AI has the potential to transform workplace productivity – but do organizations know how to deliver on that promise? New research shows that employees who use generative AI tools engage with them up to ten times per day, spending over three hours per week interacting with AI at work. And yet within the same organizations, large groups of employees have never meaningfully engaged with these tools at all.

MCP and A2A: What They Are and Why They Matter for Autonomous IT

MCP and A2A are the two protocols that make agentic AI governable at enterprise scale. One controls how agents use tools, and the other controls how agents work together. AI in the enterprise is no longer confined to chat windows. It’s operating inside incident queues and automation pipelines. Increasingly, teams are using AI agents to take action: detecting incidents, executing remediations, updating tickets, coordinating across systems.

From signals to savings: Optimizing cloud costs with Grafana Assistant and MCP servers

In today's cloud-native environments, managing resource waste and optimizing costs can feel like a constant battle. Operators, along with their fearless FinOps teams, spend countless hours hunting down unused resources, deciphering complex telemetry data, and manually implementing code or configuration changes to try to reduce cloud costs. But what if you could automate the entire process, from identifying waste to implementing the fix, all based on actual production telemetry?

5 Ways You Can Improve Your Shipping Operations

No business can be truly successful if they have not optimised its shipping operations. In fact, without optimisation, this facet of your organisation can cost you valuable resources such as time and money. With that in mind, check out our suggestions on how you can improve the shopping operations in your organisation, below.

How Long Does Deep Research Take? We Timed 5 Tasks With & Without AI

How long does deep research take? That's a million dollar kind question if you've ever lost a weekend to digging through sources for a report. You already know the pain of hours of searching, reading, and synthesizing, only to wonder if you missed something crucial. We gathered experiment data comparing traditional research methods against modern AI tools across five common professional tasks. The exact time savings we measured might surprise you, and they reveal how AI is quietly redefining what it means to be a deep researcher.

How to Reduce MTTR with AI-Powered Runtime Diagnosis

Reducing Mean Time to Resolution (MTTR) in production systems requires understanding failure behavior in real time. While AI code agents significantly accelerated software development and deployment, incident resolution has remained constrained by incomplete pre-captured telemetry. AI SRE tools improve signal correlation, but MTTR reduction requires runtime-verified diagnosis that confirms execution behavior directly in production systems.

Evaluating Observability Tools for the AI Era

Every observability vendor has an AI story right now. Most have an MCP. Many have a chatbot. All have a demo where the AI finds the root cause of an incident in thirty seconds and everyone in the room nods. In the context of a public demo, these tools look almost identical. Ask the AI a question, the tool returns an answer, and the engineer fixes the bug. Impressive. But if you buy based on the demo, you may end up with an AI layer that looks great on a call and disappoints in production.

The Hidden Cost of AI Productivity: When Efficiency Turns Into "Brain Fry"

A new HBR study reveals that the race to build and manage AI agents may be pushing knowledge workers toward a new form of cognitive overload. If you spend any time on LinkedIn these days, you’ve probably seen the same type of post over and over. Someone proudly announces they built an AI agent that now writes their emails, analyzes data, drafts presentations, and maybe even ships code.

How Developers Build a Meaningful Career in the Age of AI

What does a meaningful developer career look like in the age of AI? We brought together four experts to answer exactly that. In this GitKon panel, GitKraken CMO Kate Adams moderates a conversation with Leon Noel (Managing Director of Engineering, Resilient Coders), Danny Thompson (Director of Technology and host of The Programming Podcast), Maggie Hunter (Recruitment Lead, GitKraken), and Dimitry Fonarev (CEO, Testkube) to explore how software engineers can future-proof their careers, grow their skills, and navigate an industry that is changing fast.

Why Generic AI Fails in Ops: What Trustworthy Actually Requires

Enterprise operations reached a point where complexity outpaced human interpretation and outgrew the capabilities of generic AI. As environments became more distributed and interdependent, every incident, anomaly, and degradation produced ripple effects across systems that require context, lineage, and reasoning. Yet most AI models were not built for this reality. They were trained for general knowledge tasks, not the deeply connected operational truths that define enterprise performance.
Sponsored Post

Runtime Validation vs Static Analysis: Why You Need Both

Runtime validation does not replace static analysis. They solve different problems. Static analysis catches structural defects in code before it runs. Runtime validation catches behavioral failures by testing code against real production traffic. Enterprise teams adopting AI coding tools need both layers because AI-generated code introduces a new class of defects that neither layer catches alone. According to CodeRabbit's State of AI vs Human Code Generation report, AI-generated pull requests contain roughly 1.7x more issues than human-written ones. Many of those issues pass static checks cleanly.

AI Coding Agents Have a UX Problem Nobody Wants to Talk About

The pitch was simple: let AI write your code so you can focus on the hard problems. Three years into the AI coding revolution, and developers are focused on hard problems alright, just not the ones anyone expected. Instead of designing systems and solving business logic, engineers in 2026 spend a startling amount of their day managing the AI itself. Should you use Fast Mode or Deep Thinking? Haiku or Opus? Cursor or Claude Code or Windsurf? Should you write a SKILL.md file or a custom system prompt?

Claude outage analysis: What happened on March 11

On March 11, 2026, users around the world began reporting problems with Claude, including login failures, API errors, and stalled responses. While the disruption did not affect every user, reports quickly showed that the issue was widespread. StatusGator began receiving outage reports at 13:56 UTC. Using its Early Warning Signals system, StatusGator detected the growing incident at 14:22 UTC. The provider officially acknowledged the outage later at 14:44 UTC.

Why Your NOC Will Ignore AI

Imagine you are driving to work and a yellow check engine light flickers on your dashboard. The car feels fine. It accelerates normally, there is no strange noise, and the temperature gauge is steady. What do you do? If you are like most people, you keep driving. You might make a mental note to look at it later, but you don't pull over on the highway and call a tow truck.

The bare metal problem in AI Factories

As AI platforms grow in scale, many of the limiting factors are no longer related to model design or algorithmic performance, but to the operation of the underlying infrastructure. GPU accelerators are key components and are responsible for a large part of the total system cost, which makes their continuous availability and stable operation critical to the output and efficiency of the entire AI platform.

What is Ambient AI in Healthcare? Revolutionizing Clinical Care, Efficiency, and Outcomes

You probably use ambient AI every day without even knowing it. When your Apple Watch is telling you to stand up after sitting too long, your CGM recommends you eat a snack, or even when your smart home lights dim around the time you go to bed, every night…that’s ambient AI. Among other things, ambient AI is there to help you stay healthy, tracking what you do in the background and making decisions based on your previous actions and preferences.

MCP vs. CLI for AI-native development

Summary: The CLI vs. MCP question is really a question about where you are in the development loop. CLIs fit the inner loop: fast, local, zero overhead. MCP servers fit the outer loop: external systems, shared infrastructure, structured access. Most teams need both. AI has put a new kind of scrutiny on developer tooling. When a developer works alongside an AI coding assistant, the tools that assistant can reach, and how it reaches them, directly affect the quality and speed of the work.

Buy vs Build in the Age of AI (Part 2)

In Part 1, we explored how AI has dramatically reduced the cost of building monitoring tooling. That much is clear. You can scaffold uptime checks quickly, generate alert logic in minutes, and set-up dashboards faster than most teams used to schedule the kickoff meeting. So the barriers to entry have fallen. But there’s a quieter question that rarely gets asked in the excitement of building. Have you ever calculated what it would actually cost to replace your monitoring provider?

Unleashing Resilience: Why the Agentic Era Demands a Unified Data Fabric

Imagine starting your day with a dozen disconnected apps where your calendar does not sync with your reminders, your maps do not know your appointments, and your contacts are not linked to your messages. You would constantly be scrambling, missing key details, and reacting late to what matters most. In our personal lives, we depend on tight integration to keep pace with the world. In business, the stakes are even higher.

The future of Search is here: Faster, simpler, AI-driven

Do more with less. That’s the mandate we’re all hearing. AI has fundamentally changed how we work. Modern AI workloads generate 10-100x more queries than humans ever could, pushing legacy architectures past performance limits. And the audacity of it all? Legacy logging vendors continue to raise costs without delivering meaningful innovation. IT and security teams are still forced to choose between speed and retention. Investigations are still slow. Data onboarding is still painful.

The Rise of AI App Builders in Agile Development Environments

Modern software development moves quickly. Businesses need to test ideas, release updates, and respond to customer feedback faster than ever before. Agile development methods were created to support this need for speed and flexibility. In recent years, a new type of tool has begun to support these processes even more. An AI app builder helps teams create applications with less manual coding by using artificial intelligence to assist with design, development, and testing tasks.

The Evolution of Vocal Removal Technology in Music Production

Music production has always been shaped by technological innovation. From the early days of analog recording to the modern era of digital audio workstations, every advancement has changed the way artists create, edit, and experience music. One particularly fascinating development in this journey is the evolution of AI Music Generator vocal removal technology. Once a complicated and imperfect process, removing vocals from a track has gradually transformed into a highly accurate and accessible capability used by producers, DJs, musicians, and even casual music enthusiasts.

How Techdome accelerates AI-led product delivery with Civo Kubernetes

Accessing cloud infrastructure shouldn’t slow down product innovation. Yet for many engineering teams building AI-driven platforms, traditional hyperscalers often introduce unnecessary complexity, high costs, and slow provisioning cycles. At Civo, we’ve seen a different approach emerge. Our cloud platform enables teams to move faster with Kubernetes, compute, and networking designed for simplicity and speed.

The data context gap: an evaluation guide for agent-ready infrastructure

Why do AI agents that look brilliant in a sandbox fail the moment they hit production? For platform leaders, the answer is a lack of environmental parity: the ability to interact with the exact data state and service topology where the actual bugs live. When an agent attempts to modify a schema, optimize a query, or reproduce a bug without access to the real-world data state, it hits the Data Context Gap.

When Your Plant Talks Back: Conversational AI with InfluxDB 3

No one wants to stare at a plant and guess if it needs water. It’s much easier if the plant can say, “I’m thirsty.” A few years ago, we built Plant Buddy using InfluxDB Cloud 2.0. The linked article is still a great guide for cloud-first IoT prototyping as it shows how quickly you can connect devices, store time series data, and build dashboards in the cloud with the previous version of InfluxDB. But this time, the goal was different.

Context is the New Currency: Building a Context-aware Enterprise with Agentforce

Corporate investment in Generative AI is outpacing value realization. While Large Language Models (LLMs) possess vast general reasoning capabilities, they suffer from a critical blind spot: they are pre-trained on the public internet, yet completely blind to your enterprise reality. This context gap renders even the most advanced models ineffective, forcing them to guess (hallucinate) rather than reason based on your specific business rules.

How AI Agents Communicate: Understanding the A2A Protocol for Kubernetes

Since the rise of Large Language Models (LLMs) like GPT-3 and GPT-4, organizations have been rapidly adopting Agentic AI to automate and enhance their workflows. Agentic AI refers to AI systems that act autonomously, perceiving their environment, making decisions, and taking actions based on that information rather than just reacting to direct human input.

The architecture advantage: Why the data layer decides the AI race

Dozens of startups are sprinting to build the next “agentic SIEM” that can autonomously detect, investigate, and respond to threats. They’re well-funded, well-marketed, but structurally hollow. Here’s what it usually looks like: an LLM layer on top of a thin orchestration engine on top of fragmented or customer-hosted data lakes. While it looks impressive in a demo, it quickly falls apart in production. Why? It’s not built on a strong foundation.

GitKraken Explains: How AI is Changing Your Commit History

AI commit message generation is fast, accurate, and consistent. It's also missing the most important thing: the why. AI-assisted Git workflows can summarize a diff in seconds, but they optimize for description, not decision-making. In this video, we break down what AI commit messages do well, where they fall short, and how to use them without quietly erasing the context future teammates (and future you) actually need.

Root Cause Analysis in Software Testing: Methods, Techniques, and How AI Is Changing the Game

If you've ever fixed a bug only to watch it come back two weeks later, you already understand why root cause analysis matters. Patching symptoms feels productive - it's not. Getting to the actual cause is what prevents the same issue from eating your team's time over and over again. This guide covers everything you need to know about root cause analysis (RCA) in software testing: what it is, how to do it, which tools help, and where AI is taking it next.

Meet the new Cribl Search: Faster investigations with AI

Get a quick look at the new Cribl Search experience—built to help teams investigate faster, onboard data easily, and get answers from their logs without complex query languages. In this quick overview, we show how Cribl Search helps you move from raw data to insights in minutes: The result? Faster investigations, simpler workflows, and powerful AI-assisted analysis across your telemetry. Learn how the new Cribl Search makes exploring and analyzing data easier for everyone—from experienced analysts to teams just getting started.

What is AI really going to bring to the table when it comes to migration?

Explore the real capabilities and limitations of AI in system and SIEM migrations. Learn where AI accelerates processes and where human review remains essential. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

How AI-Powered Wellness Platforms Are Reshaping HR and Employee Well-Being

As hybrid work continues to redefine how organizations operate, companies are increasingly turning to artificial intelligence to support not only productivity but also employee well-being. Businesses are realizing that technology can play a major role in protecting the mental and physical health of their teams while also strengthening overall organizational performance.

Four ways engineering teams use the Datadog MCP Server to power AI agents

Since the Datadog Model Context Protocol (MCP) Server first launched in Preview, Datadog has experienced an overwhelming amount of interest and feedback from customers. We appreciate those who requested access to test our product, provided feedback, and shared their stories of how the MCP Server helped them overcome engineering challenges.

You Bought the AI Licenses. Why Is Only One Developer Getting 10x Results?

Here's something nobody talks about at the AI strategy meetings. Your organization just spent six figures on Cursor licenses, Claude seats, and Copilot subscriptions. Ninety percent of your engineers have access. By most internal measures, the rollout was a success. But somewhere on your team, one developer is running circles around everyone else.

Create a Custom Service Health Board With the Honeycomb MCP

Your software is sending data to Honeycomb. Now where is the dashboard you want? The best dashboard is one created just for your application, or your service, or your team. You can get that in minutes with the Honeycomb MCP. Open your coding agent in your IDE, or on the command line in your code repository. Configure the Honeycomb MCP and authenticate with Read and Write permissions. Now tell it what you want. You can be high-level: Make me a service health board for the frontend service.

Seedance 2.0 vs Traditional Production: Is AI Finally Production-Ready?

Every few years, a new tool appears that forces the creative industry to pause and reassess its assumptions. In 2026, that conversation is happening again, this time around AI video. The question is no longer whether AI can generate impressive demo clips. That phase is over. The real question is far more consequential.

AI for Operations Teams: Using Legal Awareness to Reduce Risk and Improve Decision-Making

Operations teams sit at the center of most organizations. They coordinate processes, manage vendors, support compliance requirements, and ensure that day-to-day activities run smoothly. While their role is often associated with efficiency and logistics, operations professionals increasingly find themselves interacting with another critical area: legal documentation.

AI Systems Status Report - February 2026

This report covers the operational status of major AI systems during February 2026, including Anthropic, Cohere, DeepSeek, Google Gemini, Groq Cloud, OpenAI, Perplexity, Replicate, and xAI. The data includes official incidents reported on vendor status pages and unconfirmed incidents detected through IsDown's monitoring systems.

Avoiding Common Mistakes When Using AI Content Tools

AI writing tools are everywhere. They're fast, affordable, and impressively capable. But somewhere between "generate" and "publish," things go sideways for a lot of people. The problem isn't the technology itself. It's how people use it. Hand someone a power drill, and they can build a deck - or put a hole through a water pipe. Same tool, wildly different outcomes. Most mistakes with AI writing tools are preventable. This article breaks down the biggest ones and shows you how to sidestep them before they cost you traffic, credibility, or both.

Webinar recap: FinOps In The AI Era - A Critical Recalibration

In March 2026, CloudZero’s Ben Austin, Director of Product Marketing, sat down with Ray Rike, Founder and CEO of Benchmarkit, to walk through findings from FinOps in the AI Era: A Critical Recalibration, a joint survey of nearly 500 organizational leaders on how they’re managing or, rather, struggling to manage AI costs.

AI at Superhuman (before it was cool) feat. Loïc Houssier

What does it actually look like to build an AI-native product and lead an engineering team through the AI era when you've been doing it longer than most? Rob Zuber sits down with Loïc Houssier, CTO at Superhuman, to talk about what it meant to be an AI company before AI was everywhere, and how that early foundation shapes the way they build, ship, and think today.

Why the AI market is shifting

The AI revolution is getting expensive. Ben Norris (AI Engineer at Civo) breaks down a staggering statistic: AI token usage has jumped from 9.8 trillion to 1.3 quadrillion in just under two years—a 130x increase. As businesses scale, the "closed source" premium is becoming a bottleneck. Watch as Ben explains why enterprises are turning toward democratized, open-source AI and smaller vendors like relaxAI to maintain power at a fraction of the cost.

Harness AI + MCP server: A Single Prompt to Accelerate the Software Development Lifecycle

Pipeline Creation: Using a single prompt in the IDE, a CI/CD pipeline is created and triggered via the agent connected to the Harness MCP server. Failure Diagnosis and Fix: When the pipeline fails, the agent is used to diagnose the issue (a failed dependency) and propose a fix, which is then committed, pushed, and the pipeline re-triggered to succeed. Deployment: After a successful build, the artifact is deployed into a Kubernetes cluster. Incident Response.

How Autonomous Are Your IT Operations, Really?

This post introduces a six-level maturity model that defines what true autonomy looks like in IT operations, from basic AI chat interfaces to fully coordinated agent ecosystems. ITOps teams have more automation tooling than ever, and yet incident response still depends heavily on human judgment to hold it together. Alerts fire, engineers dig through dashboards, context gets assembled by hand, and someone at the end of the workflow makes the final call.

What is Agentic Observability?

Agentic observability is the instrumentation and correlation needed to explain and control agent behavior across multi-step workflows. Legacy observability focuses on runtime health and service behavior. You monitor metrics like CPU usage, memory, latency, and error rates to confirm that applications and infrastructure are functioning as expected. When a workflow degrades, the proximate cause is often a crash, timeout, permission error, or resource constraint.

GPU Fragmentation Is Killing AI Economics

By 2026, the GPU shortage isn’t a supply-chain hiccup anymore. It’s baked into the system. Even after pouring billions into CapEx, most enterprises still want 40% more GPU capacity than they actually have. And it’s not because they’re chasing moonshots. Technology companies are training foundation models while serving inference for millions of users on the same clusters. AI labs are juggling fine-tuning, evaluation, and real-time experimentation side by side.

Top 12 AI and LLM Observability Tools in 2026 Compared: Open-Source and Paid

Artificial intelligence has moved far beyond experimentation. In 2026, AI systems are embedded into customer support workflows, clinical decision support tools, fraud detection engines, and internal copilots across nearly every industry. Adoption is accelerating quickly. According to McKinsey, 23% of organizations are already scaling agentic AI systems, while another 39% are actively experimenting with them. Yet the path to reliable production AI remains uncertain.

How AI-Powered ATS Systems Are Transforming Modern Recruitment

Recruitment has changed dramatically over the past decade. Companies are no longer relying on manual CV screening and gut-feel interviews. Instead, AI-powered Applicant Tracking Systems (ATS) are reshaping how organizations hire - faster, smarter, and with less bias.

Your Questions About AI-Assisted Development Answered

We recently hosted a webinar on AI-assisted development with DORA, and the audience had a lot of questions—far more than we could get to in an hour. I picked out six that get at the stuff people are wrestling with day to day. These aren't the easy questions, and I don't think there are necessarily easy answers, but I've spent the past year building and shipping with AI coding tools and observing (literally) what happens when that code hits production. Here's what I have.

AI-ready sovereignty playbook 2026: how to run gen-AI workloads (ethically) in the EU

Sovereignty is a concept that can have shown nuances in the way it is currently used by states and industry to describe some services. The term “strategic autonomy” has also been used, as to describe the need for governments to ensure that they have a hand on the full value chain (or at least know the gaps and accept the risks) and can apply their rules while it seats in its jurisdiction (autonomy derives from the greek autos (self) nomos (rule).

What Is LLMjacking? The New AI Cybercrime Stealing Cloud AI Compute

LLMjacking is a new cybercrime where attackers steal access to cloud-hosted AI models and use them for free — while the victim pays the bill. In this video, we break down what LLMjacking is, how attackers exploit compromised credentials and exposed APIs, and why security teams should treat AI infrastructure as a high-value attack target. Discovered by the Sysdig Threat Research Team, LLMjacking is quickly becoming the AI-era equivalent of cryptojacking — except instead of mining cryptocurrency, attackers run expensive large language models (LLMs) at scale.

Meet the new Bits AI SRE: Deeper reasoning, twice as fast

When we announced Bits AI SRE at DASH 2025, we introduced an autonomous SRE agent that investigates alerts the moment they trigger. Bits AI SRE reads the same telemetry data as your team, understands your architecture, and follows your runbooks to identify likely root causes before you even open your laptop. It’s your AI teammate that’s always on call.

How AI lets you talk to your company's data and get answers instantly

In this conversation recorded at Elastic’s New York office, three product leaders discuss how AI agents are transforming enterprise software. The discussion features Steve Kearns (general manager, Search solutions at Elastic), Mike Nichols (general manager, Security solutions at Elastic), and Baha Azarmi (general manager, Observability at Elastic). They explain how Elastic Agent Builder allows teams to interact with their data using natural language instead of complex queries.

How LLMs can help boost productivity

Learn how large language models (LLMs) are transforming productivity in business, coding, research, and daily workflows. Discover practical ways to use AI tools to automate tasks and improve efficiency. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Scaling AI Workflows With Proxy Infrastructure

AI workflows require consistent access to diverse data sources to maintain accuracy. How do teams guarantee that their systems do not go dead when rate limits are reached? The scaling of these processes is based on a stable connection layer that eliminates interruptions during retrieval. Writers are likely to have difficulties with their automated scripts triggering blocks on social sites. This article discusses the process of establishing a trustworthy machine learning and automation environment.

The Future is Faceless: Why Stock Footage is Dying in 2026

Remember the last time you searched for "diverse business team laughing at laptop" on a stock footage site? You scrolled past the same forced smiles, the same generic office backgrounds, and the same overacted "eureka" moments that have been circulating for a decade. Then you paid a subscription fee for the privilege of looking like every other brand on the planet. That era is ending. In 2026, stock footage is dying-not because we need fewer visuals, but because creators have finally found something better: total creative freedom without the cheesy middleman.

We Turned Our WireShark Wizard Into a Markdown File

Rocky AI — Checkly’s AI agent — is now Generally Available. We developed Rocky AI over the last ~6 to 8 months. This is an aeon in AI-years. During this period, we learned a ton. About AI, but mostly about how to fit them into an existing SaaS product, not just another chat widget. This is my ramble…

How to Build AI-Native Security Resilience (And Finally Get Developers And Security On The Same Team) | Harness Blog

Developers and security professionals have struggled to get on the same page for what seems like forever and AI is only making that divide larger, according to results from our State of AI-Native Application Security 2025 research report.

Hot Takes: What the AI Hype Gets Wrong About Software Engineering Excellence | Harness Blog

Ahead of the DevOps Modernization Summit, Matthew Skelton, CEO & CTO of Conflux shares his takes on output-driven AI, how DORA metrics aren’t enough, and why governance and compliance must be built into the platform. ‍ Matthew Skelton is the CEO & CTO of Conflux and a featured speaker at this year’s DevOps Modernization Summit. Ahead of our annual summit, Matthew has shared his hot takes on AI, DORA, and the key to successful automation.

7 Real Ways to Modernize NetOps with Kentik AI Advisor

Kentik’s AI Advisor acts as a virtual network engineer, helping teams of all skill levels troubleshoot, manage, and optimize their infrastructure with unprecedented speed and context. We explore seven practical NetOps use cases, from rapid incident triage and capacity planning to upcoming live-device command support, that demonstrate how using AI as a collaborative teammate dramatically reduces manual investigative work.

Skills vs. MCP: You're probably reaching for the wrong one

Everyone is adding Model Context Protocol (MCP) servers to everything right now. And I get it. MCP is clean. It’s standardized. You write a server, expose some tools, and suddenly your LLM can query your log platform, pull a dashboard, and fire an alert. It feels like the right abstraction. But I’ve watched teams at serious companies burn weeks building MCP integrations for workflows that should have been skills, and build skills for things that genuinely needed MCP.

Inside Pandora's Box: How CloudZero AI Hub Cracks Cloud Cost Intelligence

Years in the FinOps trenches taught me one thing: The data has never been the problem. The data exists. It’s out there, scattered across provider invoices, buried in tagging gaps, locked behind dashboards that maybe three people in your org actually know how to navigate. The real problem? Nobody can get to it when they need it. Engineers ship features without understanding what they cost the business, let alone whether they improved margin.

How does AI enhance search?

Explore how artificial intelligence enhances search engines through semantic understanding, vector embeddings, and contextual retrieval. Learn how AI-powered search delivers faster and more accurate results. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

AI SRE in Practice: Enabling Non-Experts to Troubleshoot Kubernetes

Kubernetes troubleshooting traditionally requires deep platform expertise. Understanding pod lifecycle, decoding error messages, correlating events across resources, and identifying root cause all demand experience that takes years to build. This expertise gap creates a bottleneck where only senior engineers can handle production issues, limiting how quickly teams can resolve incidents.

Buy vs Build in the Age of AI (Part 1)

A few months ago, I spoke to an engineering manager who proudly told me they had rebuilt their monitoring stack over a long weekend. They’d used AI to scaffold synthetic checks. They’d generated alert logic with dynamic thresholds. They’d then wired everything into Slack and PagerDuty, and built a clean internal dashboard. “It used to take us weeks to prototype something like this,” they said. “Now it’s basically instant.” They weren’t wrong.

Introducing Rocky AI to General Availability

After months of being available in Beta for our app users, Rocky AI is now generally available to all users and plans. Rocky AI is Checkly’s AI agent that works around the clock, 24/7, to make sure your application’s reliability is optimal. In this first release, Rocky AI ships with the ability to run continual Analysis on test and check failures, giving your teams AI-powered root cause analysis, impact analysis, and more.

Claude Code Security Launch Triggers Cybersecurity Industry Reassessment

On February 20, 2026, Anthropic launched Claude Code Security, an AI-based tool to scan codebases, identify security weaknesses, and provide patching solutions. The Claude Code preview caused a panic that resulted in billions in lost market capitalization among cybersecurity stocks. CrowdStrike shares decreased by 8%, reaching approximately $388.87, while Okta experienced a 9.2% decline and Zscaler saw a 5.5% drop in its stock price. That demonstrates the increasing investor anxiety about AI technology developments that threaten to disrupt established cybersecurity frameworks.

Did ChatGPT take down Claude?

On March 2, 2026, Claude experienced a widespread service disruption that affected users across North America, Europe, Asia, and Australia. The outage quickly drew significant media attention, with numerous technology news outlets reporting on user frustration and downtime. In the early hours of the incident, some commentators speculated that the disruption may have been caused by a sudden influx of new users migrating from OpenAI. However, there is no public evidence confirming that theory.

Responsible transformation: Agentic AI for the public sector

The world is transforming, and artificial intelligence, especially agentic AI, is quickly becoming embedded across private and public sectors. For government agencies, law enforcement, and mission-critical organizations, embracing this new reality is uniquely challenging. On the one hand, agentic AI promises measurable improvements: modernized IT workflows, faster analysis, improved citizen services, and operational efficiency.

CloudZero Launches Claude Code Plugin To Bring Cost Intelligence Into Engineering Workflows

Today we’re announcing the CloudZero Claude Code Plugin, a new capability that puts CloudZero’s full cost intelligence model directly inside Claude Code, where engineers and technical FinOps practitioners already work. The plugin connects a Model Context Protocol (MCP) server and nine pre-packaged investigation skills to CloudZero’s cost data, covering cloud and AI spend across AWS, GCP, Azure, Snowflake, MongoDB, OpenAI, Anthropic, and more.

Enabling Proactive ITOps with Skylar Advisor

By continuously connecting signals across your IT environment, Skylar Advisor turns operational complexity into clear, prioritized guidance. It highlights potential impact, explains why it matters, and delivers clear next steps so IT teams can act early and stay ahead of alerts before they turn into issues.

When was the term artificial intelligence coined?

Discover when the term artificial intelligence was first introduced and how it shaped the future of AI research and machine learning. This video breaks down the origin of AI and its historical significance in modern technology. About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

How AI Is Quietly Revolutionizing the Way the Legal World Write

There's a persistent image of the lawyer: brilliant, overworked, surrounded by mountains of paper, billing $800 an hour to draft language that hasn't meaningfully evolved since the 19th century. It's not entirely wrong. Legal writing is one of the most document-heavy, precision-demanding disciplines on earth. A misplaced comma in a contract has cost companies millions. A vague clause in a will has torn families apart.

Why Your AI CX Investment isn't Moving the Needle - An Honest Assessment

Your team deployed the conversational AI. Implemented sentiment analysis. Built real-time dashboards that show exactly when customers get frustrated. You can see Customer is about to churn over a billing error. You know their satisfaction score dropped from 8 to 3. And yet nothing happens. The billing error persists. The customer leaves anyway. Your NPS hasn't moved in 18 months.‍ This isn't a technology problem. It's an execution problem. And it's costing you customers.

The Tide of AI - Surfing the Tsunami of Binaries

AI is creating an overwhelming surge of digital artifacts and software components. The key to success is learning how to ride, secure, govern, and manage that wave – rather than being overwhelmed by it. This weekend, I asked my team to watch Chasing Mavericks. Jay Moriarity (not J-Frog, but stay with me) was one of the most driven and determined surfers imaginable. His courage and spirit were extraordinary. But those virtues were shaped and refined by his mentor, Frosty Hesson.

Why we open-sourced AURA: Infrastructure for production AI

Over the last year, I’ve talked to dozens of SRE teams about AI. The excitement is real, but conversations hit a wall when we get to production reality. How does an agent manage complex context without losing the plot? How does it avoid hallucinating relationships between signals? Who owns the orchestration logic that ties it all together? We realized the bottleneck wasn’t model intelligence. It was the lack of a reliable logic layer between the data and the model.

When AI Writes the Code, Who Pays the Cloud Bill?

This is part two of a series of the implications of AI generated code becoming mainstream. We recently wrote about how AI-generated code is overwhelming SRE teams with production complexity they can’t manage. Turns out that’s only half the problem. The other half shows up on the cloud bill. A prospect reached out to us last month. They’d been using Cursor and Claude Code for six months, shipping features at unprecedented velocity. Product was thrilled.