Operations | Monitoring | ITSM | DevOps | Cloud

Take Control of Cloud Costs with Proactive Budget Alerts

Proactive budget alerts turn cloud cost optimization into an everyday operational practice. If you are responsible for managing cloud infrastructure, you already know the pattern. Costs creep up quietly, and by the time anyone notices, it is the end of the month and you are explaining instead of preventing overruns. According to Flexera’s 2026 State of the Cloud Report, 85% of their respondents say managing cloud costs is their number one priority for the year.

Why Your PromQL Availability Query Returns Nothing When Services Are Healthy

Your SLI query shows 100% availability as No Data. Here's why PromQL returns empty results instead of zero — and the label-preserving fix. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

How Recurring Instability Turns into Clinical Trial Delays

In pharma, reliability becomes an operational priority because research and trial work depend on systems performing consistently across different teams, locations, and conditions. Much of that work sits inside scientific workflows, remote sessions, and compute-heavy environments where behaviour can shift with configuration or load. When that consistency starts to break down, teams keep moving, but time is lost in small increments across the day.

How it feels to run an incident with AI SRE

We've been building the broader incident.io platform for several years now, and one thing we've learned is that UX matters more here than almost anywhere else. When an incident fires, there's no room for poorly designed interfaces or fumbling through features you haven't touched in a while. The product has to be ergonomic: easy to pick up, easy to navigate, with the right things at your fingertips at exactly the right moment. We've put a lot of effort into this over the last 5 years.

What does using AI for post-mortems actually mean?

Everyone is using AI to help with post-mortems now. The pitch is obvious: post-mortems are time-consuming, the blank page is brutal, and AI is very good at producing structured, confident-sounding documents quickly. We're not here to push back on that. We've built AI into our own post-mortem experience, pulling your Slack thread, timeline, PRs, and custom fields together and giving your team a meaningful starting point in seconds. We think that's genuinely valuable, and the teams using it agree.

When agents orchestrate agents, who's watching?

You used to monitor services. Then you started monitoring AI calls inside services. Now your AI agent is spinning up other AI agents to complete tasks. Your old monitoring instincts need to evolve. This isn't hypothetical. Agentic architectures are already in production. Coding agents are calling search agents; orchestrators are spawning specialized sub-agents for retrieval, planning, and execution. Teams are shipping these systems faster than they're figuring out how to watch them.

The job is not to write code. It's to produce business value.

Most engineers can tell you exactly how many PRs they merged last quarter. Far fewer can tell you what any of it did for the business. The best engineering leaders can. They draw a straight line from their team's work to ARR: which reliability investment protected revenue, which migration unblocked a strategic customer, which operational improvement reduced churn. They lead with outcomes, not story points.