Operations | Monitoring | ITSM | DevOps | Cloud

Top tips: When "sounds right" isn't right

Top Tips is a weekly column where we highlight what’s trending in the tech world today and list ways to explore these trends. This week, we’re looking at why convincing AI answers can still be wrong and how to catch them before they slip through. AI doesn’t fail the way it used to. It doesn’t give obviously wrong answers. It gives answers that are just right enough to trust. And that’s exactly why we stop questioning it. It fits into our workflow so easily.

Faster fixes, less context sharing: how Grafana Assistant learns your infrastructure before you even ask

When an unexpected alert fires these days, most engineers' first move is to ask their AI assistant for help.You ask why your checkout service is slow and the assistant gets to work, but it can't get any meaningful insights—at least not quickly—without the proper guidance. So, the next thing you know you're sharing deals about your existing data sources, the services you have running, how they connect, which labels and metrics matter, and on and on.

Why dashboards still matter in the age of AI

I recently gave a talk at Experts Live India 2026 about SquaredUp, and even before getting into the demo, there was one question I knew I had to address: Is the dashboard era over? It's something we're all hearing more. "Just ask AI." "Agentic AI will build your dashboards automatically." "Why bother with static views when a chatbot can answer anything?" It's a fair question. Answering it requires a clear understanding of what a dashboard represents.

Context Engineering: How to Manage AI Context at Scale

Context engineering is the practice of managing the information an AI model sees (documents, tool outputs, memory, and structured metadata about the systems it reasons over) so it can make accurate decisions inside a real engineering organization. Most engineering teams have access to the same AI coding agents: Claude, GPT, Gemini, the major variants everyone is shipping. The model is no longer the differentiator.

Ticket Taker to Team Leader: Managing an Agentic IT Workforce

The promise of AI in IT service management has been circulating for years. Chatbots that deflect tickets. Virtual agents that answer FAQs. Automation that routes requests. These are useful, but probably not the dream-state you were originally sold. What's different today is the arrival of agentic AI: systems that don't just respond to instructions but reason, act, and adapt across multi-step workflows with real consequences. The question for IT leaders is no longer whether to adopt agentic ITSM.

DORA Metrics in the AI Era: Why Deployment Isn't Faster

DORA metrics in the AI era reveal a paradox: PR volume is climbing, but deployment frequency is staying flat. In this talk, GitKraken's Director of Product Jeff Schinella breaks down why AI-accelerated code generation is creating a review bottleneck that your DORA metrics can't fully explain on their own. Jeff walks through how PR metrics (cycle time, first response time, code churn, and PR size) serve as the leading indicators behind your DORA data. If your deployment frequency is flat while PR counts go up, the bottleneck isn't your devs. It's your review capacity.

Bindplane Now Ships With a Native AI Skill - Bring Your Own Agent

Today we're rolling out the Bindplane AI Skill, a built-in capability of the Bindplane CLI (v1.98+) that teaches your favorite AI coding tool how to work with Bindplane — natively, accurately, and without the setup headaches of traditional integrations. Read Part 2 of the Bindplane AI Skill series to learn more about how we built it and how it works with real-life examples.

Your Team is Using Claude Code. Do You Know What It's Costing You?

The first two weeks of Claude Code are exciting. The third week is when you realize you don’t have visibility into what it’s doing or what it’s costing you. You would not run a production service without metrics, logs, and dashboards or deploy an API without knowing its latency, error rate, or cost per request.

Moving On From MCP: How We Built the Bindplane AI Skill

If you've spent any time wiring AI coding agents into developer platforms over the last year, you've probably reached for MCP. We did too. And after enough sessions watching context windows balloon and tool calls misfire, we started looking for something different. This is the story of what we built instead — a native AI skill for the Bindplane CLI — and the engineering decisions behind it.

AI writes the code. Who delivers it safely? | Harness Blog

The question for enterprise AI in 2026 is no longer just which model. It’s which harness. An agent harness is the system around the model. It decides what the agent remembers, what context it sees, what tools it can call, what it is allowed to do, and what happens when it is wrong. The model provides intelligence. The harness provides control. This is where the real engineering is happening.

From PR to Production Without Leaving Your Cursor IDE | Harness Blog

TLDR: Today, Harness is introducing the Harness Cursor Plugin, bringing the power of the Harness AI-native software delivery platform directly into Cursor. This integration, along with the Harness Secure AI Coding hook for Cursor, allows developers and AI agents to move from code changes to vulnerability detection, CI/CD execution, security validation, approvals, deployments, and operational insight without leaving the editor. AI has completely changed how we write code.

7 best AI deployment platforms for production Kubernetes workloads in 2026

Training a model in a notebook is easy. What breaks teams is the step after, serving it reliably without haemorrhaging cloud budget or burying your SREs in YAML. The common trap: picking a platform that handles the model but not the surrounding stack. An AI deployment platform should orchestrate the full application graph (inference endpoints, vector databases, caching layers, and frontends) inside a single VPC, with GPU autoscaling that doesn't require a dedicated platform engineer to babysit.

How to use an SRE agent to reduce downtime

An alert in the middle of the night warns of a potential business failure. Manual incident response becomes more complex due to the overwhelming data from distributed and dynamic digital services. With an SRE agent, your engineering team can cut through alert clutter. They can sort through various signals quicker, decreasing burnout and achieving faster, more affordable resolutions. Operational resilience will see its next evolution with Agentic AI.

Detect, Communicate, Resolve: Checkly's Agentic Workflow End-to-End

Coding agents are the fastest-growing audience for the Checkly CLI, and we're doubling down on them. In this session, Stefan hands Claude a real e-commerce app, lets it set up monitoring with `npx checkly init`, generate Playwright tests through MCP, and walk an actual alert end-to-end with Rocky AI in the loop.

Meet AURA: The Open-Source Agent Harness for Production AI : Autonomous Incident Response Demo

Watch AURA autonomously respond to a production incident in real time—from building its reasoning context and querying PagerDuty and ClickHouse, to triggering a human-in-the-loop approval with the on-call SRE, to removing the stuck pod and validating remediation. Every behavior is defined in a simple config. AURA is Mezmo's AI-powered incident response agent built for platform engineers and SREs managing high-volume telemetry pipelines.

Harness Cursor Plugin Demo: AI for Software Delivery from Your IDE

Stop context-switching between your IDE and your CI/CD dashboards. In this video, we demonstrate the new Harness Cursor Plugin, a native integration that brings the full power of the Harness AI Software Delivery Platform directly into Cursor. Using the Cursor Agent window and the new Harness Model Context Protocol (MCP) server, you can now manage your entire software delivery lifecycle through natural language. From triggering pipelines to governing deployments, this plugin ensures you stay in your flow while maintaining enterprise-grade security and control.

Meet Auvik AI: Bringing Practical Intelligence to IT Operations

Across the IT industry, AI is being positioned as the next evolution of operations. But for many IT teams, AI still feels disconnected from the tools they rely on every day. Dashboards get smarter. Reports get faster. But workflows stay the same. Stuck in vendor silos or a CLI, IT teams have been looking for ways to bolt AI into workflows, but what often comes out is a Frankenstein-like web of APIs and MCP hosts. AI is meant to make life easier for IT teams – not make it more difficult.

How Auvik AI Solves the Biggest Challenges in IT Operations

Modern IT operations aren’t short on tools. Monitoring tools. Ticketing systems. Alerting platforms. Documentation repositories. Dashboards. Scripts. Runbooks. And yet, when something breaks, the workflow still looks strangely familiar: Somewhere along the way you’re asking yourself: Is the problem even here? This is the everyday friction of IT operations. Not the big outages. Its the constant small mysteries that take far longer to solve than they should.

Securing the World's Biggest Machine: Critical Infrastructure, AI, and the Ethics of Innovation

What happens when decades of critical infrastructure experience meet today’s rapidly evolving AI landscape? In this episode, host Bob Slevin sits down with Ernie Hayden, award-winning author, former Navy nuclear officer, ethical hacker, and founder of 443 Consulting, for a deep dive into what it truly takes to secure modern, interconnected systems.

Two AI agents, one incident: Rocky AI comes to the terminal

A Playwright Check fails at 2 am. The login flow is broken. Until today, that alert triggered a human to get up, open the Checkly dashboard, copy Rocky AI root cause analysis (RCA), and then tell an agent to get to work. There were two AI agents, one incident, and no way for them to talk to each other. The extended checkly checks and new checkly rca CLI commands close that gap. Your coding agent can now pull Rocky AI's analysis into its ongoing work, read the diagnosis, and go fix the code.

New in the Honeycomb Academy: Learn to Use the Honeycomb MCP

Two things happen when engineers first connect the Honeycomb MCP to their AI assistant. The first is the blank page problem. The Honeycomb UI gives you something to react to: a heatmap, a query builder, a trace to click into. An AI assistant gives you a cursor and nothing else. When you don't know where to start, that's a hard place to be. The second shows up right after you get past the first one. You ask a question, you get a confident-sounding answer, and you're not sure whether to trust it.

Building for Resilience: An Engineering Guide to the Mythos Era | Harness Blog

The release of Anthropic Mythos and Project Glasswing marks an exciting and pivotal new chapter in software development. As the industry advances, the speed and economics of vulnerability exploitation have fundamentally shifted. What once took weeks of manual reconnaissance can now be scaled rapidly through automated models. However, this is not just a security problem to solve. It is a massive engineering opportunity to build cleaner, more robust systems.

From Vibes to Signals: Observing Your AI Coding Workflow

Agentic coding tools like Claude Code and Codex have taken centre stage and inserted themselves into the critical path of software development. This shift has happened fast, and for most teams, the visibility hasn’t caught up. Until now we’ve been evaluating our vibe coding the same way – on vibes. You might say “this feels faster” or “that seems like a better approach”. That’s not going to scale.

Connecting Agents for Real-Time Root Cause Analysis with Checkly's Rocky AI

Rocky, Checkly's AI agent, monitors production sites and provides an analysis for every failing check. Previously, a coding agent couldn't access this analysis, leaving incidents and agents disconnected. Now, you can access all the analyses via the Checkly CLI (or API) and tell your coding agent, "Hey, I got a Checkly alert. Please investigate!" With Rocky's structured analysis delivered inline, the coding agent can start with a strong hypothesis, fix issues, and propose a PR in one session.

Inclusive AI vs. centralized AI: Can India avoid big tech concentration?

At the 2026 India AI Impact Summit in February 2026, 92 countries and international organizations (including the US, China, and the UK) signed a preliminary agreement that positions AI as both a development tool and a shared global responsibility. “India will not be a mere consumer in the AI age. We will be the creators, the builders, and the exporters of intelligence and we are proud to be able to participate in that future.” Gautam Adani, chairman of the Adani Group.

Future-Proof your services with agentic AI Operations Cloud

Digital services are the engine of your modern business, but keeping them running feels like a constant battle. The rapid increase in the volume and speed of operational data is a direct result of growing architectures and more intricate workloads. Alert fatigue is causing your teams to be slow and reactive in addressing incidents, and this is a surefire path to burnout. The pace of this new reality is beyond what traditional, human-led processes can match.

How Mezmo Uses Active Telemetry for Faster AI Root Cause Analysis

AI-powered root cause analysis only works when the data going into the model is clean, relevant, and structured. In this demo, we show how Mezmo's Active Telemetry approach helps engineers and SREs move from noisy application errors to immediate clarity. Using a restaurant ordering application running in Kubernetes, we trigger a database connection pool exhaustion issue and walk through two ways to investigate it with Mezmo.

See how Mezmo's AI Assistant instantly pinpoints root causes

This video shows how Mezmo's AI Assistant turns noisy telemetry into clear answers when errors spike. By preprocessing data and surfacing only the most relevant patterns, Mezmo quickly identifies issues like database connection failures or resource shortages and delivers actionable recommendations. Watch how AI-powered root cause analysis helps teams troubleshoot faster and with confidence. Mezmo's AI Assistant is built for platform engineers and SREs who need fast, reliable root cause analysis across high-volume telemetry pipelines — without manually sifting through noise.

How to Improve Your IT Reliability as a Business Owner

Running a small company often feels like spinning plates. You handle sales, hiring, and finance, and hoping the computers just work. When the Wi-Fi drops or a server crashes, everything stops. Improving your tech reliability is not about fancy gear. It is about creating a stable foundation for your daily operations.

Testing AI Image Platforms From The Prompt Up

Many AI image reviews begin at the end: they compare finished images and decide which one looks most impressive. That can be useful, but it misses something important. A finished image is only one part of the experience. The path from prompt to result matters just as much. When I tested AI Image Maker against other major platforms, I focused on how each product handled the full prompt journey, from the first instruction to the final usable image.

The Role Played by Artificial Intelligence in Product Design Nowadays

Ever since artificial intelligence became the new normal, building products has also taken a completely different form. Before, designers used to depend on guesses and long testing periods. That isn't the case anymore. AI is able to study data, see the patterns in them and suggest better options. It isn't surprising that it has now become a necessity for several companies.

Why Copilot alone won't fix your business workflows

Microsoft has been pushing Copilot hard over the past year. Between the rebrand of Office to Microsoft 365 Copilot, the launch of Copilot Tasks, and the more recent arrival of Copilot Cowork, there is a clear message: AI is supposed to handle the heavy lifting. For many businesses, though, the reality is more complicated than the marketing suggests. Copilot is a strong productivity tool within its own ecosystem, but expecting it to fix workflows that span multiple disconnected systems is where things start to fall apart.

Who's on call? How Claude helped us calculate this 2,500x faster

Schedules are a core part of any on-call system. In ours, they define who to page and when. But people use them in lots of other ways too: checking their next shift, asking for cover while at the gym, keeping a Slack user group up to date, or updating a Linear triage responsibility. For many of our customers, they’re one of the main ways they interact with our product, and as they’re such a foundational part of On-call, it’s very important they work well.

Introducing Seer Agent: The answer is already in Sentry. Now you can ask for it.

This is a story about an engineer’s night that could have been bad, but ended up… not so bad. A few weeks ago, on a Saturday, our AI debugger, Seer, started failing. Note the big scary spike on the right. The errors were generic failures from the LLM calls, nothing that pointed at a root cause. Most of the team wasn’t scheduled to be on this weekend, and it just so happened Indragie, our Head of AI, was online. He started paging engineers.

Context-Driven AI You Can Trust: How Edwin AI Earns Confidence in Production

Most legacy AIOps investments underdeliver because the AI lacks context, not capability. LogicMonitor’s latest innovations expand Edwin AI’s contextual intelligence across every dimension, so recommendations are accurate, explainable, and trusted by the teams that need to act on them. Reduce incident resolution time with AI that understands your environment—not just your alerts.

LogicMonitor Advances Autonomous IT with No Blind Spots, Trusted AI, and Closed-Loop Action

LogicMonitor is advancing Autonomous IT with one platform that brings together complete visibility, AI with context, and governed action across the digital environment. In this announcement video, Andrew Keating shares how LogicMonitor is helping enterprises reduce blind spots, trust AI more, and move from detection to action. Modern IT teams are managing more complexity, more tools, and more noise than ever. That’s why LogicMonitor is bringing infrastructure observability, Internet performance, digital experience, and AI-driven operations together in one platform.

LogicMonitor Advances Autonomous IT with No Blind Spots, Trusted AI, and Closed-Loop Action

LogicMonitor’s latest innovations span the entire platform to deliver the operational foundation enterprises need for Autonomous IT—complete visibility from infrastructure to end user, AI that reasons in full context, and closed-loop automation that moves from detection to resolution. Over 90% of organizations rely on at least two to three monitoring solutions—and many enterprises operate five or more.

Stop watching the looms: why the AI era belongs to infrastructure

I live in Manchester, England now. I moved here from Texas last summer (which is its own story), but the thing I wasn't prepared for is how the Industrial Revolution isn't history here. It's the city itself. And if you're American like me, you might need to hear this: the Industrial Revolution didn't start in the US. It started here. Manchester is where the modern world was born. You see it everywhere. The old cotton mills converted into apartments.

Your AWS Kiro Agent Can Now Query CloudZero. Here's What To Ask It

CloudZero's new AWS Kiro integration puts cost intelligence directly in your agentic IDE. Ask plain-language questions about spend, attribution, and cost-per-serve without leaving your development workflow. We see a similar pattern playing out across engineering teams running agentic development tools: code gets shipped fast, something moves in the cost data, and understanding why still requires leaving your environment entirely.

Your CEO Wants You To Ramp AI Usage Without Breaking Budgets. Here's How You Can Do It

Notes from a finance leader whose job this is. A few weeks ago, I traveled to Philadelphia for a conversation with a prospective CloudZero customer. We’d been working with the prospect’s engineering team for some weeks, demoing our platform in view of the RFP they’d drawn up. This stage had gone well, and so the next step was talking it over with the prospect’s CFO. We expected a conversation centered around the key criteria in the RFP.

Automate your critical workflows with AI agents in 5 steps

Many teams remain bogged down by operational chaos and manual drudgery, even with access to a variety of automation solutions. These tools often operate in silos, creating disconnected islands of automation that require significant human effort to bridge. Agentic AI offers a path forward, creating a cohesive system that can intelligently and autonomously handle complex operational workflows.

last9-genai: Closing the Conversation Gap in LLM Observability

OpenTelemetry's GenAI instrumentation gives you spans and token counts. It does not give you conversations, workflow cost rollups, or prompts visible in your dashboard. last9-genai is an OTel extension that fills those three gaps — without replacing your existing observability stack. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

The Best AI Chatbots of 2026

AI has since become an integral part of our lives, whether it’s for work or personal use; we all use AI in some form or another. However, deciding which is the best AI depends on how you want to use it. Whether it's for general questions, coding, deep research, or image creation, we’re lucky enough that there is an AI model available to help you out.

15: Optimizing AI Workloads: Balancing Cost, Performance, and Scalability with Bijit Ghosh

In this episode, Andrew Hillier and Bijit Ghosh discuss the evolving landscape of AI, discussing the growing prominence of inference over training, hybrid cloud strategies, balancing cost with performance, and the orchestration of complex hardware environments. The conversation also touches on emerging concepts like AI factories, the challenges of sovereign cloud, and how enterprises are navigating data gravity and regulatory constraints. It's a deep dive into optimizing AI infrastructure, managing costs, and the disruptive changes that are transforming both technology and business outcomes.

Demo - Selector Platform CoPilot Diagnosis

See how Selector’s AI Copilot accelerates issue diagnosis in real time. In this demo, watch how natural language queries and AI-driven insights help teams quickly analyze incidents, surface root cause, and understand impact - without digging through multiple tools. Instead of manual investigation, Selector guides operators to answers faster, reducing noise and speeding up resolution. Built for network and operations teams who need clarity, speed, and smarter troubleshooting.

Introducing the Cortex AI Assistant (now in Slack)!

Mention @Cortex in any Slack channel the Assistant has been invited to, public or private, and get grounded answers pulled from your Cortex data. Questions can be as simple as "who owns payments-api?" or as analytical as "what's driving our incident trends this quarter?" The Assistant pulls context from all across Cortex, including ownership, Scorecards, Initiatives, on-call, dependencies, and Eng Intelligence metrics, and holds context across a threaded conversation.

Accelerating AI Agent Development on Google Cloud with JFrog MCP Registry

Developers building agentic AI on Google Cloud have powerful infrastructure at their fingertips: Gemini 3 for reasoning, Google’s Agent Development Kit (ADK) for orchestration, and a rapidly expanding ecosystem of Model Context Protocol (MCP) servers that connect agents to data and tools. So why are so many teams still waiting weeks to ship their first agent to production?

What "AI-Ready Data" actually means for observability teams

Many organizations deploying AI are learning similar lessons right now: the challenge isn’t this or that AI model, it’s the data. According to Gartner, 60% of AI projects will be abandoned by organizations because of failures to support these projects with AI-ready data. Also, 63% of organizations either lack or aren’t sure they have the right data management practices to get there.

Why Your Agentic AI Aspirations Need to Evolve from Models to a Workflow Data Fabric

Enterprise conversations today are dominated by one phrase: Agentic AI. Across boardrooms and innovation labs, organizations are experimenting with copilots, autonomous agents, and AI bots capable of resolving tickets, recommending actions, and orchestrating complex processes. The promise is real — AI that doesn't just generate insights, but takes meaningful action. Here's the uncomfortable truth: most enterprises are architecturally unprepared for the agentic future they're trying to build.

Understanding disaggregated GenAI model serving with llm-d

llm-d is an open source solution for managing high-scale, high-performance Large Language Model (LLM) deployments. LLMs are at the heart of generative AI – so when you chat with ChatGPT or Gemini, you’re talking to an LLM. Simple LLM deployments – where an LLM is deployed to a single server – can suffer from latency issues, even with just one user. This can be because of lack of memory-bandwidth on the server, or because of KV cache pressure on system memory.

SRE agent vs. traditional engineer: 7 key differences

The role of a Site Reliability Engineer (SRE) is evolving. The focus has shifted from simply working harder during an outage; A new kind of teammate is here to help: the SRE Agent. But what are the key differences when you compare an SRE agent versus a traditional site reliability engineer? This isn’t just a superficial change. It signifies a fundamental alteration in how teams construct and sustain dependable services.

Live Runtime Investigation in Claude Code with Lightrun MCP

In this video, Lightrun’s Dan Putman demonstrates what happens when Lightrun MCP is integrated within Claude Code. See how, once activated, Claude can ask specific questions about what services it can see and instrument in order to perform a deep investigation in production to get to a validated root cause analysis without the friction of redeploying or switching contexts.

Debug Live Production Apps in Codex with Lightrun MCP

Lightrun’s Dan Putman demonstrates the power of the latest Lightrun MCP skill. Watch how your AI code agent can now debug live applications directly in production. By connecting OpenAI's Codex to real-time runtime data via the Lightrun MCP, engineers can now generate and validate hypotheses using live telemetry and snapshots, without breaking flow. Ready to bring runtime context to your AI agents?

90% AI Adoption. Still Failing. DORA Explains Why.

AI adoption is nearly universal. So why are most teams still struggling? In this session from GitKon, Nathen Harvey, head of DORA at Google Cloud, shares findings from the 2025 DORA State of AI-Assisted Software Development report, drawing on data from nearly 5,000 developers worldwide. The answer isn't more AI. It's what surrounds it.

That's Not a Job for an LLM: The Right Way to Apply AI to Network Operations

LLMs have sucked all the oxygen out of the AI conversation — but AI is much more than just LLMs, and network engineers have been using AI techniques (machine learning, statistics, fuzzy logic, expert systems, neural networks) for decades. So what should LLMs be doing in network operations, what shouldn't they be doing, and how do agentic AI architectures fit in?

What is AI SRE? The Complete Guide to AI-Assisted Site Reliability Engineering

It's 2:47 AM. PagerDuty fires. You open a Slack alert and see: p99 latency spike on checkout-service. You SSH into the host, check dashboards in four tabs, grep logs for the last 20 minutes, and eventually find a slow query introduced in a deploy six hours ago. It took 34 minutes. You resolved it, w Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Code Agents Need Observability

For those of us using tools like Claude Code, Codex, or Gemini, we already know they’re powerful. They can write code, refactor functions, open PRs, even run commands. For a lot of developers, they’re already part of the daily workflow. But once you zoom out beyond the individual developer, the biggest problem isn’t productivity. It’s control. AI coding tools are powerful, but they introduce a new, unpredictable cost layer that most teams don’t fully understand.

How AI Is Reshaping Bill of Materials Management

Most of what gets written about AI in manufacturing is hype. I've sat through enough vendor demos to recognize the pattern: a slick interface, cherry-picked examples, and a vague promise that machine learning will "transform" something. Half the time the underlying problem could have been solved with a structured database and a junior analyst.

Sentry + Claude Agents: Automatic Bug Fixes from Root Cause to PR

Seer, Sentry's AI debugger, automatically analyzes your issues and finds the root cause. Now you can pass that analysis directly to a Claude agent - a managed agent session in the Claude Console at platform.claude.com. Once it's done, a link to the branch appears in Sentry so you can review and merge the PR. This video walks through how the integration works and how to set it up in under two minutes.

The Claude Bill is Too Damn High #speedscale #claude #aiagents #aicoding #devops #llms

Stop overpaying for AI reasoning by trading expensive GPU cycles for efficient, deterministic testing. This video explores how tools like linters and traffic replay can complement Claude, helping you fix bugs more accurately while cutting token usage by up to 50%. Visit: speedscale.com to learn more.

How is Agentic AI fundamentally different from earlier automation?

Autonomous operations has been the goal for years. But most “automation” never got us there—it just helped teams keep up. Now that’s changing. Agentic AI introduces a fundamentally different model:– Purpose-built agents, not static workflows– Real-time decisioning, not predefined rules– Collaboration across agents, not isolated tasks Instead of automating steps, agentic AI enables systems to **reason, adapt, and act**—at a speed and scale humans simply can’t match. That’s what turns autonomous operations from a long-standing ambition into something actually achievable.

From Keyword Search to Ask AI: How We Upgraded AppSignal's Docs Experience

Documentation search is often the last thing devs think about, until someone posts publicly that they couldn't find a basic answer, or your support queue fills up with things that are genuinely in the docs. We decided to get ahead of that. This is the story of how we went from a minimal keyword-only search on our docs to a conversational Ask AI experience.

Shipping trustworthy code with Chunk CLI

AI coding agents are fast. They generate functions, refactor modules, and wire up boilerplate faster than any human. What they don’t do by default is enforce the conventions a specific team has agreed on: the lint rules, the review patterns that senior engineers flag on every PR. A generated diff looks clean until someone runs CI or reads it carefully.

How Diffusion Transformer Models Power Hyper-Realistic AI Avatar Videos

The AI avatar videos from a year ago still had a tell. The mouth movement was a little off, the facial expressions were a bit stiff. It was a quality that made it obvious that you were looking at a digital human and not a real one. The uncanny valley issue was not a small aesthetic problem, it was the only thing that stopped the practical adoption of anything other than novelty use cases.

Run Local LLMs on Mac to Cut Claude Costs

Part of the motivation for this post is how cloud API economics are shifting: Anthropic is moving large enterprise customers toward per-token, usage-based billing (unbundled from flat seat fees), which makes “always call the API” a moving cost line for teams at scale. A hybrid or local layer is one way to keep spend bounded while you still use premium models where they matter.

How it feels to run an incident with AI SRE

We've been building the broader incident.io platform for several years now, and one thing we've learned is that UX matters more here than almost anywhere else. When an incident fires, there's no room for poorly designed interfaces or fumbling through features you haven't touched in a while. The product has to be ergonomic: easy to pick up, easy to navigate, with the right things at your fingertips at exactly the right moment. We've put a lot of effort into this over the last 5 years.

AI for Incident Response: Should You Build or Buy?

SREs and platform teams are overwhelmed by the effort of manually troubleshooting ever-more complex cloud-native environments. This pain is driving a breakneck adoption of AI SRE solutions that promise to automate core reliability practices, from root cause analysis to capacity planning. For teams with strong engineering talent, creating a DIY AI SRE seems like a straightforward challenge.

Introducing Ubuntu 26.04 LTS | Resolute Raccoon

Ubuntu 26.04 LTS, codenamed, is now available to download. Resolute Raccoon builds on the resilience-focused improvements introduced in interim releases, with TPM-backed full-disk encryption, improved support for application permission prompting, Livepatch updates for Arm-based servers, and Rust-based utilities for enhanced memory safety. This release also brings native support for industry-leading AI/ML toolkits like NVIDIA CUDA and AMD ROCm, making Ubuntu 26.04 LTS the ideal platform for AI development and production workloads.

AI agents are only as smart as the data you feed it

AI is only as useful as the context you give it. An autonomous observability agent can unlock serious value from your telemetry, but only when the foundation is right: good telemetry, a strong data layer, and efficient access to the data. Annie Freeman and Lewis Isaac had a lot to say about this at AWS Summit London this week! hashtag#Observability hashtag#AI hashtag#AWSSummitLondon hashtag#DevOps hashtag#OpenTelemetry.

Why Mandating AI Tools Backfires on Engineering Teams

Responsible AI adoption for engineering teams starts with culture, not compliance. In this GitKon talk, Rizel Scarlett (Tech Lead of Open Source DevRel at Block) shares how Block helped thousands of engineers actually want to use AI tools, including Goose, Cursor, Claude Code, and more, without mandates, vibe coding disasters, or security gaps.

Rootly's Dan Sadler: why AI coding tools are driving more incidents + why reliability is the product

Cortex co-founder and CTO Ganesh Datta sits down with Dan Sadler, VP of Engineering at Rootly. Dan explains how Rootly treats reliability as a product feature rather than just a technical metric, and why culture might be the most impactful element of building reliable systems.

Voices You Can't Trust: Securing K-12 Communications Against AI Deepfake Threats

It starts with a voice you recognize. A call from the superintendent asking for an urgent update. A voicemail from a principal requesting sensitive student information. A message that sounds authentic, because it is, at least on the surface. The tone, cadence, and even the subtle inflections are exactly right. But the request isn’t. AI-powered deepfakes are rapidly reshaping the threat landscape for K–12 schools, turning trusted communication channels into potential points of vulnerability.

Human First, AI Second: Cycle's Approach to AI Coding in 2026

It is easier than ever to launch a product from scratch. Today, AI can make your team of two feel like a team of ten almost overnight. Enterprises across the tech industry are completely restructuring engineering teams to double down on AI coding, often incentivizing engineers for the sheer amount of code they push. The AI revolution is incredible. So, you would be crazy not to hop on the vibe coding train right? Well it depends on what exactly you are building.

When agents orchestrate agents, who's watching?

You used to monitor services. Then you started monitoring AI calls inside services. Now your AI agent is spinning up other AI agents to complete tasks. Your old monitoring instincts need to evolve. This isn't hypothetical. Agentic architectures are already in production. Coding agents are calling search agents; orchestrators are spawning specialized sub-agents for retrieval, planning, and execution. Teams are shipping these systems faster than they're figuring out how to watch them.

What does using AI for post-mortems actually mean?

Everyone is using AI to help with post-mortems now. The pitch is obvious: post-mortems are time-consuming, the blank page is brutal, and AI is very good at producing structured, confident-sounding documents quickly. We're not here to push back on that. We've built AI into our own post-mortem experience, pulling your Slack thread, timeline, PRs, and custom fields together and giving your team a meaningful starting point in seconds. We think that's genuinely valuable, and the teams using it agree.

GPT Image 2 Brings Visual Work Closer

Most AI image tools are easy to praise in a vague way. They can generate striking pictures, imitate styles, and turn a short prompt into something that looks impressive enough to share. But that kind of praise has started to feel cheap. The image model market is crowded now, and "it makes beautiful images" is no longer a meaningful claim by itself.

What Is LLM Observability? For CFOs And Engineers, The Missing Layer Is Cost

You probably have Datadog. Maybe New Relic, maybe Dynatrace. Your observability stack has been solid for years — and you're still flying blind on AI cost. Here's why LLM observability needs a fourth pillar most tools skip, and how to build one that actually tells you what your models are costing you per request, per feature, per customer.

Blind Tokenmaxxing Is The New Cloud Waste. Focus on Outcome-Maxxing Instead

Meta's internal token leaderboard sparked a frenzy — and a reckoning. Tokenmaxxing without attribution is just cloud waste 2.0. Companies like Hudl and Duolingo use cost intelligence to connect every AI dollar to a business outcome.

Why Enterprise AI Demands More Than Just Automation

Based on insights from The Intelligent Enterprise podcast, “The Evolution from Automation to Autonomy” Every couple of weeks, The Intelligent Enterprise podcast steps away from the day-to-day noise of enterprise life to explore big ideas from a fresh perspective. In one recent episode, the focus turned to a question many organizations are still grappling with: What does it really take to build an AI-powered enterprise that works with people, not against them?

Episode 10 - How I Learned to Stop Worrying and Love AI

Are we still in the first chapter of AI, and mistaking it for the whole story? In this episode of The Intelligent Enterprise, host Tom Stoneman zooms out from the headlines to explore where we really are in the AI journey. He’s joined by journalist and independent analyst Joe McKendrick, who has spent decades documenting how emerging technologies reshape business and society. As co-chair of the AI Summit in New York and a senior contributor to Forbes and ZDNet, Joe brings the perspective of someone who understands how these stories unfold over time.

The New Economics of Enterprise AI: Why Small Models Win Where It Matters

For years, progress in AI was equated with scale. Larger models, broader parameter counts, and increasingly complex cloud architectures were treated as signals of advancement. In enterprise operations, however, scale alone does not determine success. Economics does. As AI becomes embedded in operational workflows, organizations are discovering that model size is less important than cost stability under continuous load. AI-driven operations do not run in bursts. They run constantly.

The Regional Data Centre Revolution Powered by AI Demand

London still hosts the biggest concentration of UK data centre capacity, but the centre of gravity is starting to move. AI workloads are changing the infrastructure maths, pushing power, space and planning considerations up the decision list. That is exactly where regional locations start to look like the sensible option. Government data shows how concentrated the market remains: as of autumn 2024, London is estimated at 1,048MW of colocation IT load. Compare that with 44MW in the East of England, 17MW in the North East and 30MW in Scotland. The gap is huge, yet it is not a permanent advantage.

Before You Deploy Another Agent, Read This

Enterprise boardrooms are not debating whether to adopt agentic AI anymore. The debate has moved to a harder question: why do so many agentic deployments stall between pilot and production? ServiceNow's Enterprise AI Maturity Index 2026 puts a number to it. Most enterprises that have invested in AI tooling report that their biggest obstacle is not model quality or compute cost. It is the infrastructure that those agents are expected to operate within. The models are capable.

What's New in VictoriaMetrics Cloud Q1 2026? Logs, MCP Server, Better Alerting, and... a Secret Project

Q1 2026 has been one of our most eventful quarters yet for VictoriaMetrics Cloud. We shipped something we have been building towards for a long time, crossed a few infrastructure milestones, and started clearing the path for what is coming next to the most performant observability stack.

Grafana Assistant everywhere: Customize and connect to the AI agent to fit your specific needs

The ways you and your teams build and observe your systems are changing. It’s no longer just engineers looking at dashboards, or writing queries or config files. More often, it’s an agent interacting with the data, too, helping write code, run applications, investigate incidents, rightsize deployments, and more.

AI Observability in Grafana Cloud: A complete solution for monitoring your agentic workloads

The observability industry has developed great tools for using metrics, logs, traces, and profiles to monitor the cloud native applications that have dominated the last decade of software development. But when it comes to understanding what an AI system is actually doing, we’re often left reading raw conversations, guessing at quality, and reacting too late. And that’s a problem.

Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Evaluating agents is hard. Verifying observability tasks is harder. Yes, AI agents have gotten dramatically and quantifiably better at coding and tool use, but observability presents a different kind of challenge. In a real incident, the hard part is rarely just writing a query. It's deciding which signal matters, figuring out whether a spike is noise or symptom, correlating metrics with logs and traces, and sometimes making a change in Grafana without breaking the dashboard another engineer depends on.

Claude Opus 4.7 Pricing In 2026: What It Actually Costs (And Whether It's Worth It)

Claude Opus 4.7 holds at $5/$25 per million tokens — but a new tokenizer inflates costs up to 35% on identical text. Here's what Opus 4.7 actually costs at production scale, how it compares to Sonnet 4.6, and the six levers that determine where your bill lands.

Building for the Agentic Era: Engineering Excellence at Harness | Harness Blog

As AI agents become ubiquitous across the software development lifecycle, engineering teams must do more than adopt new tools; they must redesign how they build, verify, and operate software. This post distills the vision, priorities, and best practices that guide engineering excellence at Harness. Different products sit at the heart of the Harness platform.

Why Mid-Market IT Teams Are Drowning in Tickets - And How AI Concierges Are Finally Fixing It

Every IT leader I've spoken to at a mid-market company (50-500 employees) tells me some version of the same story. Their team is good. Their tools - usually ServiceNow, Jira Service Management, or Freshservice - are solid. But the volume of inbound requests is relentless. Password resets at 9am. VPN issues at 2pm. "My Zoom isn't working" at the worst possible moment before a client call. The tickets never stop, and the IT team never has enough bandwidth to focus on the work that actually moves the business forward.

Stop Fighting Your Mouse: How I Traded "Drafting Slavery" for Real Restaurant Design

I've spent the better part of twelve years in the hospitality design trenches. If there's one truth I've brought back from the front lines, it's this: the soul of a restaurant is decided long before the chef ever steps into the kitchen. It's won or lost in your Restaurant Floor Plan.

Uptrace MCP Server: Auto-Generate Dashboards with AI in Minutes

Tired of clicking through menus to build observability dashboards? In this video I walk through how to configure the Uptrace MCP (Model Context Protocol) server and connect it to an AI assistant so your dashboards get created automatically from natural-language prompts. You'll learn how to: By the end you'll have a working setup where describing what you want to monitor is enough to get a real, shareable dashboard in Uptrace.

Ivanti Launches Agentic AI on the System of Record You Trust

Investors and enterprises are finally asking the question they'd been avoiding: which software companies will survive the AI revolution, and which will be made obsolete by it? The answer is becoming clear. Companies that serve as the system of record, the authoritative source of truth that AI itself depends on, are essential.

Diff-erent Perspectives: How Specialized LLM Personas Catch More Bugs

We’ve built a multi-LLM PR reviewer that runs on every pull request in a couple of our own repos. Two independent models look at each change in parallel, each wearing a set of “persona hats” tuned to a specific area of the codebase. They compare notes, duplicates get stripped out, and the PR author ends up with a single review comment rather than a wall of noise.

Qovery Q1 2026 Demo Day

See our latest retrospective and live updates. We're showcasing Event-Based Autoscaling via KEDA, allowing you to scale on business metrics that actually matter. We’ll also debut Copilot Troubleshoot to solve complex deployment failures instantly, demonstrate how MCP Agents are setting a new standard for your workflow, and share more about NGINX migration. Qovery is the Kubernetes management platform built for the AI era.

Building the AI Stack for Modern Network Operations - Surya Nimmagadda

AI is rapidly transforming network operations — but what does it actually take to build an AI stack that works in production? In this session from AI for Network Leaders – Powered by Selector, Surya Nimmagadda breaks down how modern AI systems for network operations are designed, deployed, and used today. He covers: This session is designed for network engineers, architects, and operators looking to move beyond theory and understand how AI is being applied in real production environments.

Inside the AI Agents Transforming Network Operations - Joby Rudolph & James Schnebly | Selector

AI agents are becoming a core part of modern network operations — but what does it actually take to build and deploy them effectively? In this session from AI for Network Leaders – Powered by Selector, Joby Rudolph and James Schnebly break down how AI agents are designed, implemented, and applied in real-world network environments. They cover: This session provides a practical look at how AI agents are moving from concept to production — and what it takes to make them work at scale.

AI Meeting Bots Were Just the Beginning. Meet the AI Collaborator

Why the next era of enterprise AI isn’t about note-taking — it’s about digital workers who actually show up and do the work. There’s a moment every IT operations leader knows well. A critical incident hits at 2 PM on a Tuesday. Within minutes, a war room meeting spins up — a Google Meet or Teams call crowded with network engineers, SRE leads, cloud architects, and storage admins, all staring at dashboards and talking over each other. Someone is manually pulling syslog data.

Debug frontend issues with AI: Real user monitoring meets the Coralogix MCP server

It is 2 AM. Someone on-call gets paged. Conversion rates on the checkout page dropped 30 percent in the last hour. The immediate questions are familiar. Is this a JavaScript error? A slow API call? A broken third-party script? A performance regression that never throws an exception but quietly drives users away? In most teams, answering those questions is not hard because the data is missing. It is hard because the investigation is split across too many places.

Dark Code: The AI-Generated Software Nobody Understands

The biggest risk to your product isn’t AI-generated code that doesn’t work. It’s generated code that seems fine. AI doesn’t optimize for correctness. It creates something passable. Something that passes the smell test. And when everybody in the industry is pushed to move faster and do more with less, you end up shipping software that looks correct. It passed your quick visual check. It passed all the tests. But no one ever fully understood it.

Beyond AI Vibes: Deterministic Foundations for Agentic Coding

Every week there is another model drop, another agent framework, and another workflow tweak you are supposed to evaluate. Meanwhile, the largest companies, the ones operating at the highest scale and leaning hardest on AI, are also the ones making headlines for reliability strain: capacity limits, outages, and services that buckle under load.

Agent Skills move too fast for git

Last month I was making a change to sx, our CLI. I updated a core flow, adding external catalogs as a source for sx add. Small change. Then came the testing. I knew I was messing with a core flow and wanted to be sure I hadn't broken anything. I spent about forty-five minutes setting up an isolated environment. Spinning up Docker. Fighting with tmux. Getting a clean install state I could run through the TUI a few times. Forty-five minutes of my afternoon that produced zero code. I complained in Slack.

Ivanti AEM 2026.2: New AI Intelligent Assist & Device View Updates

What's New with Autonomous Endpoint Management (UEM, DEX, Platform, UWM, EPM) 2026.2 Discover the future of IT operations with the latest Autonomous Endpoint Management (AEM) 2026.2 updates. This "Innovator Preview" explores how Ivanti is integrating Ivanti Neurons AI and Intelligent Assist to streamline device management, improve DEX, and enhance platform performance across UWM and EPM. In this video, we cover.

AI Everywhere & Zero Trust: Ivanti's 2026 Endpoint Management Strategy

What's New with Mobility MDM & EPMM 2026.2 Join Aruna Kuriti, Ivanti's Director of Product Management, as she unveils the strategic Unified Endpoint Management (UEM) Strategy 2026. This "What's new" dives deep into the Ivanti MDM and EPMM 2026.2 updates, focusing on key themes like AI Everywhere, Zero Trust, and a Unified Admin Experience. In this video, you’ll learn about: Chapters.

Autonomous AI for Cloud-Native Cost Optimization: Balancing FinOps and Performance SLAs

Platform Engineering leaders are caught between two competing imperatives. You’re under pressure to flatten cloud spend but your team is still provisioning defensively because nobody wants to be the person who causes a production incident. You try to optimize, but six months later, when someone pulls a report, nothing has changed.

Your AI Agents Are Autonomous. But Are They Accountable?

Why accountability, not capability, is the real bottleneck for enterprise agentic AI, and what security leaders need to do about it before regulators force the issue. Every enterprise is building AI agents. Marketing has one summarizing campaign performance. Engineering has one triaging incidents. Customer support has one resolving tickets. Finance has one processing invoices.

MCP Apps: On Call Compensation Report and Service Dependency Graph

This April, PagerDuty's MCP server expands with powerful new capabilities across Analytics & Reporting and Business Services. Teams can now surface aggregate incident data, service metrics, and team metrics — giving operators instant access to the operational insights that matter most. On the Business Services side, the release adds business service dependencies, subscriber management, impacted services analysis, and priority mapping. Rounding out the release are two new MCP Apps (on our experimental branch): Service Dependency graph. and an On-call Compensation report.

The Hidden Knowledge Crisis Behind Every Repeat Truck Roll in Field Service: Can AI Help?

The organization ran a farewell. Someone brought a cake. And on that same afternoon, roughly 22,000 undocumented decisions, like repair workarounds, asset-specific judgment calls, the kind of pattern recognition that only comes from two decades of showing up, quietly ceased to exist. No system captured them. No handover covered them. They left with the person. This is the operational risk that most field service leaders are misreading.

6 Ways Ops Teams Can Align AI With Business Impact

AI adoption is at an all-time high, withover 70 percent of organizations are using AI in at least one core function. Despite the high rate of AI adoption, many operational teams continue to have difficulty answering the question 'Is AI actually benefiting our business?' The challenge lies in the gap between AI systems and actual business results. Bridging the gap requires aligning operational AI with revenues, customers, and growth metrics. Here are actionable steps to transform AI from a technical tool into a measurable business contributor.

How Autonomous Technologies Are Streamlining Financial Operations for Modern Businesses

Modern businesses are under constant pressure to move faster, reduce costs, and stay compliant in a shifting regulatory landscape. Financial operations sit at the center of that pressure. Tasks like invoicing, reconciliation, reporting, and forecasting have traditionally required heavy manual effort. That is starting to change. Autonomous technologies are stepping in to handle routine processes, reduce errors, and free teams to focus on higher value work.

The Edwin AI Agent Orchestrator: Coordinated Incident Investigation Across the Tools You Already Use

Edwin AI’s Agent Orchestrator keeps incident investigation, context, and response aligned as work moves across tools, eliminating the manual handoffs that slow resolution. Every major incident has two timelines running in parallel. The first is the incident itself—services degrading, users affected, business impact accumulating. The second is quieter and just as costly: engineers switching tabs, re-explaining context to new responders, moving notes from one tool to another by hand.

AI for Everything After Code: Ship Fast, Stay Safe

Recorded at @DevOpsLive Most teams have “done DevOps” and “built a platform,” but still wrestle with the same core problems: platforms that developers dodge, AI that accelerates coding while quietly degrading delivery performance, security and compliance that can’t keep up, cloud bills that keep climbing, and incident response that hasn’t caught up with cloud‑native complexity.

PagerDuty Invests in the AI-First Operations and Resilience of Healthcare and Crisis Response Organizations

At PagerDuty, we believe operational excellence and social impact are inseparable. As AI rapidly transforms how nonprofits operate, our AI and agentic technology empower mission-driven teams to automate complexity and focus their limited resources on what matters most: delivering reliable services that create meaningful impact at scale.

Debugging multi-agent AI: When the failure is in the space between agents

I've been building a multi-agent research system. The idea is simple: give it a controversial technical topic like "Should we rewrite our Python backend in Rust?", and three agents work on it. An Advocate argues for it, a Skeptic argues against, and a Synthesizer reads both briefs blind and produces a balanced analysis. Each agent has its own model, its own tools, its own system prompt. It worked great in testing. Then I noticed the Synthesizer kept producing analyses that leaned heavily toward one side.

A Prototype's Worth 1,000 Minutes: How Claude Prototypes Accelerate The Product Planning Process

The relationship between product managers (PMs) and engineers is due for an upgrade. The division between these personas is responsible for a healthy, if laborious, collaboration when envisioning and building new products. A PM generates the vision; engineers translate it into an architectural approach, raising the technical questions that sharpen it along the way. This back-and-forth eventually produces tight alignment, a solid PRD, and functional code.

You're Running Agents. Your Tooling Is Still Catching Up.

Introducing GitKraken Desktop 12.0. At some point in the last year, the question shifted. It stopped being “should I use AI coding agents?” and became “how do I run more than one at a time without losing my mind?” If you’ve been there, you know what the management layer looks like. A terminal per agent. A worktree created by hand before each session.

Route OTel data from AI apps to ClickHouse and Datadog using Observability Pipelines

As organizations continue to heavily invest in AI and build more agentic workflows, their telemetry data volumes can surge quickly, and the associated costs can become unpredictable. To regain control of their data, many AI-forward teams are turning to high-throughput, low-latency pipelines to collect and route data to tools such as OpenTelemetry (OTel) and ClickHouse. But these self-hosted solutions come with drawbacks.

Auto-Generate Tests for Your Codebase with AI (CircleCI Chunk Tutorial)

AI coding tools help you ship features faster than ever, but test coverage often can't keep up. In this video, we show you how CircleCI's Chunk autonomous CI/CD agent finds untested code in your codebase, writes tests to cover it, and opens a pull request for your review. What you'll learn: Chunk works directly inside your CI/CD pipeline, giving it access to your build history, test results, and coverage reports. That means smarter tests, not just more tests.

Sentry Built AI Dashboards: Monitor Your AI Agents End-to-End

Building AI applications? There's a lot more to monitor beyond errors. With tracing enabled, Sentry's built-in AI Dashboards give you deep visibility into how your agents are actually performing. This video walks through three key dashboard views: You'll also see how to drill from a dashboard widget straight into the trace explorer to pinpoint the root cause of errors, how to duplicate and customize dashboards to fit your needs, and how to set up monitors with alert thresholds - like getting notified if your LLM calls exceed 20 seconds.

In the Age of AI, Taste Isn't About Aesthetics

AI can generate a UI in seconds. So what do designers actually bring to the table? Marcela, Principal Product Designer at Rootly and former Founding Designer at Ramp, has spent 20 years in design. Her answer: taste isn't about aesthetics or crafting pleasant interactions. It's about asking the uncomfortable questions, and choosing the right problem, not the easiest one.

What Parents Should Know About AI Essay Grader Tools

Artificial intelligence is showing up in more classrooms than ever before, and parents are right to have questions. One area that has grown quickly is AI-powered writing assessment. Schools and teachers are increasingly turning to automated tools to help manage the workload of grading student essays, and while this might sound like a behind-the-scenes administrative change, it directly affects how your child receives feedback on their writing. Understanding what these tools do, how they work, and what they cannot do will help you stay informed and involved in your child's education.
Sponsored Post

How to Set Up Raygun's Remote MCP Server in Cursor and Codex

After introducing Raygun's original MCP server and our new remote-first version, the most common question we hear is: "How do I actually set this up and start using it?" This guide covers exactly that, two short videos walking through setup and a real error being solved in both Cursor and Codex.

Building Agent-Friendly CLIs - What we learned at Checkly

Building Agent-Friendly CLIs: Why Your AI Agent Already Loves the Checkly CLI Stefan explains why products, docs, and CLIs must be AI-ready as coding agents rapidly become primary users of the Checkly CLI. He outlines key CLI features for agent workflows: Stefan demos how an agent initializes project-tailored Checkly setup from scratch without any human intervention and also shows how agents can entirely automate the incident life cylce from resolution to status page communication.

GitKraken Desktop 12.0 Release: Agent Sessions, Terminal Performance Boosts, and More!

If you're running Claude Code, Codex, or Gemini, managing multiple sessions means one terminal per agent, status checks by window-switching, and worktree setup from scratch every time. GitKraken Desktop 12.0 adds structure to that workflow. What's new: Works with Claude Code, Codex CLI, Copilot CLI, Gemini CLI, and OpenCode.

AppSignal MCP Now Supports OAuth - and GitHub Copilot

When we launched AppSignal MCP in beta, OAuth was on the roadmap but not yet shipped. We were issuing static bearer tokens — enough to connect Claude Desktop, Cursor, and Windsurf, but not the one-click install path in the MCP Registry, and not GitHub Copilot's recommended setup. That's fixed.

Introducing the CloudZero AI Prompt Catalog: 46 Ready-to-Use Prompts for Cost Intelligence

In early March, we launched the CloudZero AI Hub and the CloudZero Claude Code plugin, giving customers a direct line to their cloud and AI cost data through natural language. Early adopters and power users have already jumped in, using the plugin to investigate cost spikes, close commitment gaps, and get to cost-per unit metrics that used to take days to pull together. What we’ve noticed over the past few weeks is pretty consistent (and predictable).

Webinar recap: Cost Intelligence for the AI Era

CloudZero’s Umesh Rao and Larry Advey showed what it actually looks like to connect AI to real cloud cost data, and the results are hard to unsee. On April 9, 2026, CloudZero hosted a live webinar, Cost Intelligence for the AI Era, featuring Umesh Rao, Director of Enablement, and Larry “Fred FinOps” Advey, Director of Cloud Platform & FinOps.

What Is an AI SRE? And Why Do They Need Live Runtime Evidence?

AI SREs are autonomous systems that handle incident triage, root cause analysis, and remediation by correlating logs, metrics, traces, and code signals. However, as they rely on pre-configured telemetry, the critical execution details of a specific failure, such as variable state and code paths, can often be missed. As a result, they either force users into manual redeploy loops or make inferences from partial data, diagnosing issues using probability rather than proof.

Beyond the Prompt: AI Agent Design Patterns and the New Governance Gap

If you are treating Large Language Models (LLMs) like simple question-and-answer machines, you are leaving their most transformative potential on the table. The industry has officially shifted from zero-shot prompting to structured AI agent design patterns and agentic workflows where AI iteratively reasons, uses external tools, and collaborates to solve complex engineering problems.

AI vs. Hype: Redefining Engineering Excellence with Ron Miller

In this episode of "ShipTalk: Engineering Excellence," host Thomas Dockstader sits down with Ron Miller, editor at Fast Forward, to discuss the real-world impact of AI on software development. They dive deep into the maturity of AI-driven code, the rise of the "citizen developer," and why traditional writing and communication skills are becoming the new must-have for modern engineers.

How AI-Powered Phishing Is Changing What 'Suspicious Email' Looks Like

For years, spotting a phishing email was almost a checklist exercise. Look for typos, watch for broken grammar, be suspicious of generic greetings like "Dear user," and check if the sender's address looks strange. That mental model worked because phishing emails actually looked bad. Which is no longer true. With the rise of AI, attackers can generate emails that are grammatically perfect, context-aware, and indistinguishable from legitimate business communication. The obvious red flags are gone. What used to look suspicious now looks completely normal.

The Trust Layer: Why Enterprise AI Needs a Gateway Before It Needs More Models

Enterprise AI does not have a model problem. It has a trust problem. Before organizations invest in larger models or additional agents, they need a control layer that governs how those agents operate inside production systems. Without that layer, autonomy does not scale. If you talk to any enterprise leader right now, you’ll hear the same question.

The AI Zero-Day Wave Is Here. Is Your Logging Infrastructure Ready?

Last week, the cybersecurity industry received a signal it cannot afford to ignore. Anthropic announced Claude Mythos Preview: a general-purpose frontier AI model that, without any explicit training for the task, autonomously discovered and fully exploited zero-day vulnerabilities across every major operating system and web browser. Not theoretical capabilities.

User Feedback to Pull Request in Minutes with Cursor + Sentry

Cursor Automations + Sentry Triggers: go from user feedback to a pull request automatically. See how to set up an end-to-end workflow that turns feedback into code changes, posts the PR to Slack, and keeps your team in the loop. In this video, we walk through a real-world example using Sentry Docs. A user submits feedback through a widget on the docs site, it lands in Sentry as an issue, and when assigned, a Cursor Automation kicks off. The automation reads the feedback, validates it, generates a PR against the repo, and posts the link in the relevant Slack thread. No manual work required.

Offline evaluation for AI agents: Best practices

If you’re building LLM-powered applications and agents, you’ve probably asked yourself: “How do I know if my changes actually made things better?” You can tweak prompts, adjust temperature settings, or try different models, but it’s not always easy to validate whether version B’s response is better than version A’s. Most teams fly blind in preproduction and rely on user feedback to see how well their application works in the real world.

Stopping Kubernetes cloud waste: agentic automation for enterprise fleets

Agentic Kubernetes resource reclamation is the practice of using an autonomous control plane to continuously identify, suspend, and delete idle infrastructure across a multi-cloud Kubernetes fleet. It replaces manual cleanup and reactive autoscaling with intent-based policies that act on business state, eliminating the configuration drift and cloud waste typical of unmanaged fleets.

Building an agentic content production system with Claude Code

This post by an engineer explains how his team uses the.claude folder in Claude Code. The folder is the hidden directory where you store context files, behavioral rules, and automated workflows so Claude understands how to operate in a specific project. He’d set up coding conventions, tool configs, CI integrations. Very engineering-brained. The tool is called Claude Code, so fair enough. I run a web and content team. We write blog posts, tutorials, and technical guides for a living.

Top 6 AI SRE Tools and Why Runtime-Grounded Reliability Is the New Standard

AI SRE tools accelerate incident detection, root cause analysis, and remediation across distributed production systems. They ingest telemetry signals, including logs, metrics, traces, alerts, and deployment history, to correlate anomalies, narrow fault domains, and reduce manual triage. This guide breaks down the top AI SRE tools in 2026 and helps you choose the right one based on your team’s biggest bottleneck, whether that is faster triage, deeper root cause analysis, or runtime-level validation.

The quiet problem underneath modern software delivery: database change at scale

Application delivery has accelerated over the last decade. Modern CI/CD pipelines, automated testing, and cloud infrastructure have already raised the baseline. Now AI-assisted coding tools are compressing timelines further still - developers are writing and shipping code faster than ever.

(AusBiz) JFrog teams up with Nvidia to manage AI agents

AI agents are making real-time decisions inside enterprises right now; pulling code, accessing tools, executing tasks. But most businesses have zero visibility into what those agents are actually using. In this interview on @ausbizTV, Sunny Rao, SVP APAC at JFrog, explains why the governance gap is one of the biggest risks facing enterprises today; and how JFrog and NVIDIA are building the trust layer to fix it.

From Stack Trace to Probable Cause: AI Root Cause Analysis Is Here

You know the drill. An error fires, you get the stack trace, and then you spend the next 45 minutes tracing it backward through four services, two config files, and a deploy that happened three hours ago. You eventually find the root cause, but the path to get there was manual, slow, and entirely dependent on how well you already knew the codebase. We built AI-powered root cause analysis (RCA) for that kind of slog.

AI Factories Will Be Won on Efficiency: Why the Kubex + Rafay Partnership Matters

The early era for AI was defined by experimentation, standing up isolated environments, and finding the first practical use cases. Today, the conversation is different. Enterprises are no longer asking whether AI matters. They are asking how to scale it sustainably, securely, and economically. That shift is giving rise to the AI factory: a repeatable, governed, production-ready environment where data scientists, platform teams, and application teams can build, train, deploy, and operate AI at scale.

Optimizing the OpenTelemetry Python SDK for LLM Workloads

Agentic workloads thrive with precision tooling. Just like developers, they need the rich context, high cardinality, and fast feedback loops that allow them to ask exploratory open-ended questions of their code. But instrumentation is costly, and from the dawn of software, developers have tried to do the most possible with the least amount of resources.

Your AI Agents Are Only As Good As Your Data | Harness Blog

Every agent demo follows the same arc. The agent calls an API. A deployment triggers. A ticket gets created. The audience is impressed. Then someone asks a real question: "Which regions had the highest order failure rate this quarter, and are any of them linked to vendor SLA breaches?" That question crosses four entity types — orders, fulfillment records, vendors, SLA contracts.

Getting more out of Playwright CLI: a practical guide for QA and DevOps teams

If your team runs Playwright tests in CI, you already know the npx playwright test drill. It works fine until your suite crosses a few hundred tests. Then things get messy. Flaky reruns stack up. Debugging means downloading trace zip files and opening them on your laptop. Reports? Static HTML files that people stop checking after day 3.

Claude outage April 2026: what happened and how it was detected early

On April 9, 2026, Claude experienced a widespread but inconsistent outage that left many users unable to access or interact with the service. StatusGator detected the issue early and sent an Early Warning Signal 59 minutes before the provider officially acknowledged the outage. This incident highlights how early detection can provide critical lead time when official status pages lag behind real user impact.

From One Month to One Day: How CloudZero Builds Cloud Cost Connectors at the Speed of AI Adoption

Not long ago, adding a new cost connector to CloudZero was a serious undertaking. We’d task multiple engineers, build in extended review cycles, run a private preview period. But a single connector could take up to two months from kickoff to customer hands. For the major cloud providers, that timeline was acceptable. The size of the investment matched the scale of the integration. But the tools landscape has changed. Our customers’ teams don’t just run on AWS and Azure.

The Runbook Problem: How AURA Documents What Teams Don't Have Time to Write

Runbooks are rarely missing because teams don't value them. They're usually missing because incident response, follow-up, and platform work compete for the same limited time. By the time an issue is resolved, the knowledge is fresh, but the window to document it is already closing. That gap creates familiar failure modes: over-reliance on senior engineers, slower handoffs, and less confidence for whoever is on call next.

Unlocking Security Potential for AI: Introducing the Harness WAAP MCP Server | Harness Blog

Security teams face overwhelming amounts of data and complex interfaces, making it hard to access critical insights. AI tools promise solutions, but integration remains difficult as time ticks away and leadership wants the latest data to inform risk decisions. Most security platforms lack seamless integration, slowing access to important data and hindering AI-powered workflows.

Tech Talk | AI Agents in O11y Cloud

Transform reactive incident response with Splunk’s troubleshooting agents, designed to drastically reduce mean time to identify and resolve issues. This session demonstrates how a multi-agent approach empowers teams of all skill levels to pinpoint root causes, prioritize issues by business impact, and prevent future outages. Tech Talk sessions offer insightful and valuable deep-dives for any technical practitioner.

How Agentic AI Powers Hybrid and MultiCloud Operations

Hybrid and multi‑cloud environments didn’t break operations—they simply outpaced the human ability to manage them. Gartner predicts that 90% of organizations will adopt a hybrid cloud approach through 2027, confirming that multi-vendor estates are now the permanent operating model. Yet, as environments grow more distributed, a “Complexity Gap” has emerged.

In the Age of AI, Operational Memory Matters Most During Incidents

Artificial intelligence is making software easier to produce. That much is already obvious. Code that once took hours to scaffold can now be drafted in minutes. Boilerplate, integration logic, tests, refactors and small internal tools can be generated with startling speed. In some cases, even substantial pieces of implementation can be assembled quickly enough to make older assumptions about software effort look dated. It is tempting, then, to conclude that the hard part of software is receding.

The Real Path to AI Automation Starts With Less Fragmentation

Fragmentation limits AI automation because context is split across systems, forcing humans to bridge the gap. Most IT environments are fragmented by design. Observability data lives in one set of systems, investigation happens in another, and execution sits behind separate tools with their own ownership and controls. During an incident, context does not move with the work.

The History of AI in IT Operations: How We Got to Autonomous IT

Autonomous IT is the result of a long operational evolution, from static monitoring and rule-based automation to AIOps and now to systems that can increasingly diagnose, prioritize, and act within defined guardrails. Autonomous IT gets talked about like it appeared out of nowhere. As if someone flipped a switch and suddenly systems started managing themselves. The reality is far less dramatic and far more instructive. What we’re seeing today is the result of decades of incremental progress.

Why Your Website's FAQ Page Is Failing Visitors And How AI Search Can Fix It

Your FAQ page should be your hardest-working asset, but it's probably doing the opposite. Instead of guiding visitors, it's slowing them down. People land there with simple questions, face cluttered layouts or outdated answers, and leave without clarity. That frustration doesn't just hurt user experience; it quietly impacts conversions, trust, and even your search visibility. The good news? You don't need a full redesign to fix it. Most FAQ issues come down to relevance, structure, and how easily answers can be found. When those three things break, everything else follows.

Sample AI traces at 100% without sampling everything

A little while ago, when agents were telling me “You’re absolutely right!”, I was building webvitals.com. You put in a URL, it kicks off an API request to a Next.js API route that invokes an agent with a few tools to scan it and provide AI generated suggestions to improve your… you guessed it… Web Vitals. Do we even care about these anymore?

The Path to AI-Ready Operations Begins with Truth

Enterprises expect AI to improve how they operate, yet many underestimate the level of clarity required for intelligent systems to perform reliably. AI-assisted operations demand input signals that are accurate, consistent, and interpretable. They require a unified understanding of how services behave, how disruptions originate, and how decisions influence downstream outcomes. This level of coherence is impossible without operational truth.

Testing AI with AI: Why Deterministic Frameworks Fail at Chatbot Validation and What Actually Works | Harness Blog

Chatbots are becoming ubiquitous. Customer support, internal knowledge bases, developer tools, healthcare portals - if it has a user interface, someone is shipping a conversational AI layer on top of it. And the pace is only accelerating. But here's the problem nobody wants to talk about: we still don’t have a reliable way to test these chatbots at scale. Not because testing is new to us. We've been testing software for decades.

Why Connected Platforms Will Power the Next Generation of AI in Engineering | Harness Blog

AI is quickly becoming part of the engineering workflow. Teams are experimenting with assistants and agents that can answer questions, investigate incidents, suggest changes, and automate parts of software delivery. But there is a problem hiding underneath all of that momentum. Most engineering environments were not built to give AI the context it needs. In many organizations, the service catalog lives in one place. Deployment data lives in another. Incident history sits in a separate system.

Komodor Provides Autonomous AI SRE Troubleshooting for ClusterAPI

Cluster API (CAPI) is transforming how organizations deploy and manage fleets of Kubernetes clusters by introducing declarative, Kubernetes-style APIs to automate cluster provisioning and lifecycle management. While CAPI excels at creating consistent and repeatable cluster deployments across different infrastructure providers, operating it at a massive scale introduces unique day-to-day challenges.

Introducing OrionIQ: The End of Manual Observability

OrionIQ is Logz.io’s new agentic observability platform designed to move teams from detecting issues to resolving them automatically. As AI accelerates software development, operations remain manual: engineers still wake up at 2 a.m. to investigate alerts and rebuild context. OrionIQ uses AI agents to analyze real-time telemetry, investigate incidents, identify root causes, and take action across systems.

7 AI productivity lessons from the CTO of Superhuman

Most companies have built AI into their product by now, and many consider it the central feature of what they’re building. But plenty of those same companies are still figuring out how to get their own engineering teams to actually use AI tools day to day. When Loïc Houssier joined Superhuman as CTO in early 2025, his team was in that exact spot. The company had been shipping AI email features for years, but internal adoption of AI dev tools was still early.

AI Enablement for Dev Teams: The 6-Pillar Flywheel

AI adoption is already happening on your team, whether you have a strategy or not. Tracy Lee (CEO of This Dot Labs, Microsoft MVP, Google Developer Expert) breaks down the AI Enablement Flywheel — a 6-pillar framework used by successful engineering organizations to move from scattered experimentation to scalable, ROI-positive AI workflows.

Rovo Chat in Bitbucket now understands your Pipelines

Why did your build fail? Ask Rovo, get a clear answer, and even a way to fix it, from anywhere in Bitbucket Pipeline debugging is one of the most common and most painful parts of the development workflow. In our Atlassian research: AI adoption is rising, but friction persists, over 50% of developers reported losing more than 10 hours each week searching for information, onboarding to new code, or toggling between apps.

AI Didn't Change the Game, It Just Exposed Your Bottlenecks w/ Ganesh Datta (CTO, Cortex)

Every engineering org says they want to improve reliability — but most can't even agree on what "good" looks like. Ganesh Datta, Co-Founder and CTO of Cortex, has spent the better part of a decade helping companies confront that gap.

Every engineering org is taking an AI readiness test right now

Tamar Bercovici has been at Box for 15 years. She leads the core platform, the backend layer that storage, search, metadata, and AI capabilities all run on. When her systems go down, Box goes down. On a recent episode of the Braintrust podcast, she said the debate around AI-generated code tends to focus on whether the models will write clean code and/or introduce bugs. Tamar's focus is somewhere else entirely.

Top 5 Must-Have Integrations for Your Zendesk Suite in 2026

Modern customer support demands more than a basic ticketing system - it requires strategic zendesk integrations that connect your support team with AI automation, real-time analytics, quality control, multilingual content, and unified customer data. In 2026, businesses that fail to build this integrated ecosystem will struggle to meet rising customer expectations for speed, personalization, and seamless self service across channels.

Cracking the Code: How Undetectable AI Actually Works to Bypass Modern AI Detectors

In the rapidly evolving digital landscape of 2026, the tug-of-war between artificial intelligence and content authenticity has reached a fever pitch. As creators, marketers, and SEO specialists, we find ourselves in a constant cycle: we use AI to scale production, only to be met by increasingly sophisticated AI detectors designed to flag our work as "robotic.".

Episode 9 - AI, Enterprises, and the Law

In this episode of The Intelligent Enterprise, host Tom Stoneman takes us inside the different ways that AI is being utilized in the practice of law. In this episode, Tom is joined by Vintee Mishra, an attorney who’s currently part of the Commercial Contracting Organization at Navy Federal Credit Union, and has previously occupied supporting roles at Tata Consultancy Services, Cisco, First Technology Credit Union, and Moody’s Analytics.

Where Most Operational Waste Comes From-and How AI Automation Cuts It

Most operational waste comes from fragmented workflows rather than individual performance constraints. An incident begins long before any fix is applied. Alerts trigger, tickets open, and engineers start reconstructing context across systems that were never designed to operate as one. Logs, metrics, past incidents, and runbooks sit in separate tools, each requiring manual lookup, interpretation, and validation before any decision can be made.

Not All Agents Are Created Equal: Getting Agentic AI Right for IT

Three months ago, a CIO told me her organization had “already deployed agents.” Her endpoint team assumed she meant the telemetry clients on every managed laptop. Her service desk thought she meant AI chatbots. Meanwhile, her security architect heard “autonomous decision-making.” They were all right and all talking past each other. This is the agent confusion problem.

The Atlassian Rovo MCP Server now supports Bitbucket Cloud

The Atlassian Rovo Model Context Protocol MCP Server now supports Bitbucket Cloud. AI clients like Claude, ChatGPT, Cursor, and VS Code can now browse repositories, create commits, open pull requests, and check pipeline results, all through the same secure MCP connection that already works with Jira and Confluence.

Business metrics in Grafana Cloud: Get an AI assist to help securely analyze your data

For today's modern businesses, the data landscape demands security and flexibility. You need to connect your observability platform to rich, proprietary datasets that often reside in private networks without compromising security or managing complex network infrastructure. You may also face an extra layer of complexity in order to effectively query and visualize that data. Luckily, modern artificial intelligence tools have made these previously complicated processes much simpler.

The Hidden Warning Signs Before Hybrid IT Outages (And How AI Finds Them)

Hybrid IT environments are the reality for most organizations today. Unfortunately, they’re also one of the biggest reasons outages are now harder to prevent. Between on-prem infrastructure, cloud services, SaaS platforms, distributed networks, and modern applications, IT teams are managing an ecosystem of dependencies that changes constantly.

Responsible AI Writing: How Teams Use AI Tools Without Losing Authenticity

AI writing tools have made content creation significantly faster. Drafts that once required hours can now be produced in minutes, helping teams scale documentation, communication, and content production. However, speed alone does not guarantee quality. As AI-generated content becomes more common, many teams are finding that raw output often lacks clarity, consistency, or the tone required for professional use.

What Native Audio in AI Video Actually Means for the Future of Content

In 2026, the arrival of native audio has officially ended the silent film era of generative AI. For years, creators had to hunt for sound effects and manually align voiceovers in post-production, but the new standard is simultaneous generation. Native audio means the AI no longer simply adds sound to a finished clip. Instead, models like Seedance 2.0 on the Higgsfield platform generate audio and video together in a single mathematical pass. This shift from fragmented tools to a unified multimodal architecture is fundamentally changing how content is produced.

Why Autonomous AI Agents Can't Run on SaaS Infrastructure

The era of the “copilot” is ending. We are moving rapidly toward the era of the autonomous software factory, where autonomous agents don’t just autocomplete our code—they investigate, plan, test, and merge entire features while we sleep. But this shift has exposed a critical flaw in how we consume AI. For the past decade, the default motion for enterprise software has been SaaS. It’s easy, frictionless, and managed by someone else.

Deterministic by Design: How Harness Grounds AI Agents in Structured Data | Harness Blog

When AI agents operate across a multi-module platform like Harness (from CI/CD to DevSecOps to FinOps), the number one goal is to give you answers that are correct, consistent, and grounded in real data. Getting there requires a deliberate architectural choice: when a question can be answered from structured platform data, the agent should use a schema-driven Knowledge Graph rather than raw API calls via MCP. The principle is simple: if the data is modeled, retrieval should be deterministic.

Kosli and Adaptavist Partner to Automate Governance for AI driven Software Delivery

Today, Kosli and Adaptavist announce a strategic partnership to help regulated enterprises automate governance for AI driven software delivery - making it automated, continuous, and evidence-driven rather than a manual checkpoint that sits apart from DevOps and CI/CD. Adaptavist brings deep enterprise DevOps transformation expertise: assessment and strategy, DevSecOps integration, developer experience, and implementation across Atlassian, GitLab, and AWS.

AI agent observability: The developer's guide to agent monitoring

Most "agent observability best practices" content reads like a compliance checklist from 2019 with "AI" pasted over "microservices." Implement comprehensive logging. Establish evaluation metrics. Create governance frameworks. Not a single line of code. No mention of what happens when your agent silently picks the wrong tool on turn 3 and you need to figure out why.

Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA

This guest blog post is by Tohn Furutani, SRE Engineer at NTT DATA. Over the past year, the conversation around generative AI has shifted from single-shot use cases—such as summarization, Q&A, and chat interfaces—to agentic AI systems that can make decisions based on context, plan multistep actions, invoke tools, and adapt as conditions change.

The Next Phase of Agentic AI

The Enterprise AI Survey conducted by Digitate in collaboration with Sapio Research states that the journey of enterprise automation and AI adoption has evolved significantly. The initial waves focused primarily on improving accuracy, efficiency, and reducing costs. Now, the next phase, Agentic AI, is transforming this shift from mere automation to dynamic collaboration.

Practical AI-Enabled Observability for Agents and LLMs

You’re told to “go build agents” without clear guidance on what that actually means, how to do it well, or how to know if it is working. You are not a data scientist. You are a software engineer. In this talk, a Datadog AI product leader Shri Subramanian breaks down what changes when you move from building applications to building AI agents, and why familiar approaches like traditional testing and linear delivery fall short. We will explore how agent development shifts the focus from code alone to data, prompts, and evaluation, and why functional reliability matters just as much as operational reliability.

How to Catch AI Code Mistakes Before They Reach Production

AI can write code fast, but it makes mistakes humans often don't. In this session from Ole Lensmar, CTO of Testkube, breaks down the real quality risks of AI-generated code and how engineering teams can build guardrails before those bugs hit production. What you'll learn: Common mistakes LLMs make (and which ones are unique to AI) Whether you're a developer leaning on AI to ship faster or a QA lead trying to keep up with the pace of AI-generated code, this talk gives you a practical framework for staying ahead of quality issues.

LLM Cost Monitoring with OpenTelemetry

Teams running LLM applications in production face a cost problem that traditional APM tools were never designed to solve. CPU and memory costs are relatively predictable — a web service processing 1,000 requests per second costs roughly the same week over week. LLM API costs are not. A single user session can cost $0.01 or $5 depending on prompt length, model choice, conversation history, and how many retries happen inside your chain.

How AI Is Powering the Next Era of IT Operations

AI is redefining the future of IT. In this Nexus Live 2025 keynote, ScienceLogic CEO and Founder Dave Link shares the vision behind Skylar AI, why the industry is shifting toward autonomous operations, and how organizations can move faster, smarter, and more proactively than ever before. In this session you’ll see.

IREX Enhances FireTrack AI Module for Faster, More Accurate Fire Detection

WASHINGTON, DC - IREX, a global developer of ethical AI and intelligent video analytics, has announced a significant upgrade to its FireTrack fire and smoke detection module, expanding its capabilities across a wide range of environments. As outlined in an article on TNW, the updated solution is designed to work seamlessly with existing camera infrastructure, eliminating the need for additional hardware while extending its use to critical infrastructure, public institutions, residential and commercial properties, and natural environments such as parks and forests.

From AI Idea to Real System: What Changes Along the Way

Most companies don't struggle with the idea of AI. They struggle with what to do with it. The potential is clear-automation, predictions, better decisions. But translating that into something useful inside a business is where things become less obvious. That's usually when ai ml consulting services start to make sense.

AI for GitOps: Tame your Argo Sprawl | Harness Blog

Innovation is moving faster than ever, but software delivery has become the ultimate chokepoint. While AI coding assistants have flooded our repositories with an unprecedented volume of code, the teams responsible for actually delivering that code, our Platform and DevOps engineers, are often left drowning in manual toil. If you’re managing Argo CD at an enterprise scale, you’re painfully familiar with the "Day 2" reality.

How to Prevent and Resolve Incidents Using Model Context Protocol (MCP)

The rapid pace of modern software development, fueled by AI-driven coding and accelerated deployment cycles, has resurfaced a challenge that many development teams already struggled with: the speed of incident response must now match the speed of change. Every day, teams ship code faster than ever, which inevitably increases the risk of a new issue making it to production. The traditional approach—where engineers waste time jumping between disconnected tools—is no longer sustainable.

How Will We Hold AI Accountable For Risky Investments?

The word “Trillion” never fails to set the tech world on fire. Foundation Capital’s Jaya Gupta and Ashu Garg are two of the most recent firestarters. Late in December, they co-wrote “AI’s trillion-dollar opportunity: Context graphs,” outlining how AI will transition from organizational knowledge to organizational comprehension.

AI Working for You: MCP, Canvas, and Agentic Workflows - Part 2

In our previous post in our series on observability for the agent era, we looked at how Honeycomb provides unique visibility into LLMs operating in your production environment. Now, let’s flip it around and explore how Honeycomb provides observability insights uniquely suited to helping your AI agents rapidly diagnose and fix production issues, and build production feedback into the next round of development.

The Fundamentals: Fast, Deep, and Ready for What Comes Next - Part 3

The previous two posts in this series have looked at some of the use cases Honeycomb customers are implementing to observe LLMs in production and power agentic observability workflows. In this third and final post, we’ll take it back to basics and look at how the fundamental capabilities and infrastructure of Honeycomb provide the comprehensive data and fast performance that makes these use cases work at production scale. AI capabilities built on a weak observability foundation fall apart fast.

AI Demos Are Easy. Enterprise AI Is Not. | Harness Blog

‍Why 90% of AI prototypes never make it to production, and what to do about it. Every week, someone on my team shows me a demo that looks incredible. An agent that writes deployment pipelines. A chatbot that triages incidents. A copilot that generates test cases from Jira tickets. The demo takes 20 minutes. The audience claps. Everyone leaves convinced we're six weeks from shipping it. We're not.

From Data to Dollars: How AI-Driven Hyper-Personalization Is Reshaping Retail Revenue

Every retailer knows that personalization drives revenue. The evidence has been consistent for years: personalized experiences convert better, retain customers longer, and generate higher average order values. What has changed is the scale and sophistication at which personalization is now possible - and the gap it creates between brands that embrace AI-driven approaches and those still relying on manual rules and static segments.

Debugging the black box: why LLM hallucinations require production-state branching

The most frustrating sentence in modern engineering is no longer "it works on my machine." It is: "It worked in the playground." When an LLM-powered feature, such as a RAG-based search, an autonomous agent, or a dynamic prompt engine, fails in production, it doesn’t throw a standard stack trace. It returns "slop," hallucinations, or silent retrieval failures. Standard debugging workflows fail during triage because LLM hallucinations cannot be reproduced using static mocks or clean seed data.

KubeCon + CloudNativeCon EU 2026: What We Learned About AI, Observability, and Fast Feedback Loops

Honeycomb was excited to attend KubeCon + CloudNativeCon Europe, where one theme stood out across sessions: as AI reshapes how software is built and run, teams are being pushed to rethink how they understand their systems. Without strong observability and feedback loops, AI can accelerate confusion, misalignment, and operational risk.

The Business Case for AI-Driven Observability in Network Operations

Modern network operations generate an extraordinary amount of telemetry. Metrics, logs, events, topology data, cloud signals, and service context all contribute to a richer picture of system behavior. As environments expand across cloud, data center, edge, and SaaS, the opportunity for operations teams is clear: when that telemetry is unified and understood in context, it becomes a powerful source of resilience, efficiency, and business insight.

When we say "Observability AI Reckoning," what are we actually talking about?

We’ve spent the last decade collecting more telemetry. Now AI is analyzing it. Here’s the catch: AI needs the full dependency chain to reason correctly. If it sees spans but not storage contention… Services but not Kubernetes scheduling… Frontend metrics but not downstream providers… It will confidently optimize the wrong thing. AI doesn’t lower the need for observability. It raises the standard.

90% AI Adoption. Still Failing. DORA Explains Why.

AI adoption is nearly universal. So why are most teams still struggling? In this session from GitKon, Nathen Harvey, head of DORA at Google Cloud, shares findings from the 2025 DORA State of AI-Assisted Software Development report, drawing on data from nearly 5,000 developers worldwide. The answer isn't more AI. It's what surrounds it.

Understand session replays faster with AI summaries and smart chapters

Datadog Session Replay gives teams a video-like view of what real users experienced in their applications. Engineers rely on replays to connect errors and slowdowns to actual user behavior, while product managers use them to understand friction and improve critical flows. But finding the right replay and the right moment often means manually scanning long sessions without knowing whether they contain relevant signals.

How AI-Driven Automation Solves Patch Management Silos

"We see 10,000 critical vulnerabilities!" "We patched everything last week!" This conversation happens in enterprise IT departments every single day. Security teams present dashboards filled with red alerts. IT teams show deployment reports at 98% success. Both teams are looking at real data. Both are absolutely correct. And both are totally blind to what's actually happening across the endpoint environment. This isn't a people problem — your teams aren't incompetent.

AI Didn't Kill the SDLC. It Made It Harder to See

Whilst AI has compressed the visible stages of software delivery; requirements, validation, review and release discipline have not disappeared. They have been pushed into automation, runtime and governance. The real risk is not that the lifecycle is dead, but that organisations start acting as if accountability died with it.

From Reactive to Proactive: AI-Driven Automation for Shopify Infrastructure Monitoring

Operations teams manage Shopify infrastructure with their eyes half-open most days. You're monitoring system health across multiple layers, responding to alerts when they fire, and hoping you catch problems before customers notice. The whole setup is reactive by design. Something breaks. You get paged. You investigate. You fix it. But here's what most ops leaders don't realize: your Shopify operation generates enough signals to predict problems hours (sometimes days) before they actually occur. The data's there. You're just not analyzing it at the right scale or speed.

How Implementing Medical AI Scribe Transforms Patient Care

Medical AI scribes are dramatically changing the landscape of healthcare delivery by reducing the administrative burden on clinicians and improving patient interactions. Recent studies suggest that AI scribes can decrease the time physicians spend on documentation by up to 50%, allowing more time for patient care. This technology not only enhances the quality of interactions between doctors and patients but also improves diagnostic accuracy. Below, we explore how AI-powered scribes are playing a pivotal role in modern healthcare environments.

Reality Bytes Is BACK: ft. Marc Petter on the Future of IT Jobs

Reality Bytes is back—and this time, we’re diving straight into the future of IT jobs. Tom, Oriana, and Dina are joined by Marc Petter (Senior Product Manager, Nexthink) to explore how AI is reshaping roles, workflows, and career paths. From automating repetitive tasks to the rise of AI agents handling entire processes, the conversation tackles what’s changing, what still requires a human touch, and how IT professionals can stay ahead. They unpack the difference between what can vs. should be automated, and what the new IT career ladder might look like in an AI-driven world.

What Metrics to Monitor in Your Vibe Coded App

These days, using a tool such as Cursor, GitHub Copilot, Zed, or Claude makes it easier than ever to develop and deploy applications. You express your requirements, receive the completed project back as output, and there you have it! You now have an application that is in production and functioning. However, the surprise comes after the app has been deployed. When your app breaks or behaves abnormally, it may not be immediately obvious what is wrong or how to fix it.

AI Is an Amplifier, Not a Shortcut

There’s a version of the AI story that engineering leaders want to hear. It goes like this: adopt AI coding tools, watch output multiply, ship faster, do more with less. Clean. Simple. Boardroom-ready. The data tells a different story. Not a worse one. Just a more honest one. We recently analyzed 2,172 developer-weeks of real coding activity across teams using GitHub Copilot, Cursor, and Claude Code. The headline numbers are striking: power users show 4-14x higher activity than non-users.

Defeating Context Rot: Mastering the Flow of AI Sessions | Harness Blog

In Part 1, we argued that most dev teams start in the wrong place. They obsess over prompts, when the real problem is structural: agents are dropped into repositories that were never designed for them. The solution was to make the repository itself agent-native through a standardized instruction layer like AGENTS.md. But even after you fix the environment, something still breaks. The agent starts strong.

Secure and Compliant DevOps in an AI-Enabled World

Is Your DevOps Strategy Ready for the AI Era? AI is accelerating modern software delivery—but it’s also raising the stakes for security, compliance, and auditability. As AI-driven change increases, many organizations are discovering that incomplete DevOps practices are creating new risk. Based on insights from 800+ global IT professionals, the 2026 State of DevOps Report reveals why vendor‑backed, enterprise‑grade DevOps platforms are becoming critical for managing AI‑driven risk and meeting evolving regulatory demands.

Agno Monitoring & Observability with OpenTelemetry and SigNoz

Learn how to implement end-to-end monitoring and observability for Agno-based AI systems using OpenTelemetry and SigNoz. In this video, we walk through instrumenting your Agno workflows, collecting traces, metrics, and logs, and visualizing everything in SigNoz to gain real-time visibility into performance, failures, and bottlenecks. You'll see how to move from basic logging to production-grade observability—so you can debug faster, optimize latency, and confidently run AI systems at scale.

#055 - From Enterprise Java to Kubernetes and AI-Driven Infrastructure with Dan Hicks (Boomi)

Dan breaks down the fundamental similarities and stark differences between application development and platform engineering. He shares the unexpected hurdles he faced during his transition, from complex networking and CoreDNS latency to the harsh realities exposed by chaos testing in cloud environments.