Operations | Monitoring | ITSM | DevOps | Cloud

Cortex catalog data now flows into Rootly

Incident response is a context problem. The first minutes of any incident are spent reconstructing what the affected service is, what it depends on, and who owns it. That reconstruction happens during the worst possible window. The Cortex catalog already holds this data: services, teams, domains, and the relationships between them, maintained by the engineers who run those systems.

What is an AI software factory?

Ask a software engineer what they do and the answer, for years, has been some version of "I write code." That assumption is unwinding fast. AI agents can now write code, review pull requests, run tests, and ship to production, and they're taking on a fast-growing share of that work. As agents absorb more of the execution, the human role shifts.

How to run an operational excellence review for software engineering

Most engineering organizations already run something they call an operational review. It usually looks like a cousin of the quarterly business review: a deck assembled every few months, walked through team by team, anchored on whatever incidents happened to land in the previous quarter. By the time leadership sees the data, the systems it describes have moved on and the next set of risks is already accumulating in the gap.

Measuring engineering organizations in the age of AI

Engineering leadership is in the middle of a real transition, and most of the leaders I talk to know it. AI has reshaped how software gets built quickly enough that the operating models many of us spent a decade refining no longer fit cleanly, and there is a great deal of serious work happening across the industry to figure out how these models should evolve. The teams I find most impressive right now are the ones treating their operating model as an open question rather than a settled one.

How to land on the right side of the AI divide

AI changed how code gets written before it changed how code gets operated. Generation accelerated; the downstream controls that turn that output into reliable, secure software at a reasonable cost did not keep pace. The result is elevated risk, distributed unevenly across engineering organizations. A recent survey explains why the distribution is so uneven.

Should platform, SRE, and security merge into one function?

Platform, SRE, and security are three distinct functions in modern engineering orgs, each shaped by a different problem. SRE was the operations function's answer to scale: how to keep systems reliable when the systems get big. Platform answered a different problem: how to let developers ship without becoming infrastructure experts. Security drew the line on what could safely reach production.

Agent governance starts with the service catalog you already run

Last month, an AI agent running inside Cursor wiped PocketOS's entire production database, including its backups, in roughly nine seconds. The agent found an API token in an unrelated file, originally created for managing custom domains, and used that token to execute the deletion. The backups sat inside the same blast radius as the database the agent was operating against. Nine months earlier, a Replit AI agent had done the same thing to a SaaStr database during a designated code freeze.

The audit-ready engineering org

Two weeks before the audit, the Slack messages start. Get me a screenshot of this. Can you screenshot the CI/CD logs? Can you add the artifact names that were deployed to production and when, and when the incident happened? Senior engineers stop shipping. A spreadsheet appears. The product roadmap goes on hold while four people chase down ownership data and evidence that should have existed all along. This fire drill is the symptom of an operating model problem.

Your platform team's name is holding it back

When you stood up your platform team, you probably spent more time on the org chart than on what to name it. Reporting lines, headcount, scope of the first charter, those felt like the real decisions. The name was administrative. Something to put in Slack and the directory and forget about. That was the most consequential decision you made. The name you give a platform team isn't just branding. It's a scope declaration.

Context Engineering: How to Manage AI Context at Scale

Context engineering is the practice of managing the information an AI model sees (documents, tool outputs, memory, and structured metadata about the systems it reasons over) so it can make accurate decisions inside a real engineering organization. Most engineering teams have access to the same AI coding agents: Claude, GPT, Gemini, the major variants everyone is shipping. The model is no longer the differentiator.