Operations | Monitoring | ITSM | DevOps | Cloud

Top tips: When "sounds right" isn't right

Top Tips is a weekly column where we highlight what’s trending in the tech world today and list ways to explore these trends. This week, we’re looking at why convincing AI answers can still be wrong and how to catch them before they slip through. AI doesn’t fail the way it used to. It doesn’t give obviously wrong answers. It gives answers that are just right enough to trust. And that’s exactly why we stop questioning it. It fits into our workflow so easily.

Improved Microsoft 365 private status integration

Keeping track of your Microsoft 365 services just got easier. We’ve rolled out an update to the Microsoft 365 integration that removes manual setup and improves visibility. All services in your account can now automatically appear as components, so you can monitor them right away.

Reports just got smarter

We’ve upgraded the Reports page in StatusGator to give you more insight directly inside the StatusGator dashboard. Previously, reporting was limited to exports you could use to calculate your own uptime percentages and trends. Now, in addition to exported reports, you can view key reports and metrics without needing to download anything. We’ve also added a one-click download of the most commonly requested report: Uptime percentage by monitor.

Replay Real Customer API Sessions as Datadog Synthetics Tests

A customer pings support: “I tried to check out twice this morning and got a 500 each time, but it works fine for everyone else.” The session ID is in the email. You have full request/response capture in your environment, you have Datadog Synthetics already running browser checks against the same flow, and you still spend the next two hours grepping logs because none of those tools let you say “show me just this user’s requests, in order, and re-run them.”

Coralogix and Atlassian: Full-Stack Observability Inside the Incident Workflow

Incident response has a well-known efficiency problem. The tools teams use to detect and investigate issues are often disconnected from the tools they use to manage and resolve them. Engineers spend a significant portion of each incident switching between platforms, assembling context that should already be at hand. Even when the data is available, correlating signals across user, app, infrastructure, and security events to pinpoint a root cause remains manual and slow.

GitHub Outages 2025 - 2026: Reliability Analysis and Outage History

Hashicorp's co-founder Mitchell Hashimoto decided to pull out his Ghostty project from GitHub in April 2026 due to GitHub's reliability issues. He did this after 18 years of using GitHub, saying that GitHub "is no longer a place for serious work". GitHub has experienced a significant decline in reliability over the past 6 months, and Hashimoto is not alone in expressing this sentiment.

Rightsizing Nightmares: When Your Cloud Cost Tool Degrades Performance

This is what production teams see happening. A vertical pod autoscaler recommendation gets applied automatically. Resource requests come down a notch across a namespace. The cost dashboard registers a small cost savings win. A few minutes later, health checks start failing. Pods enter crash loops.

Your Team is Using Claude Code. Do You Know What It's Costing You?

The first two weeks of Claude Code are exciting. The third week is when you realize you don’t have visibility into what it’s doing or what it’s costing you. You would not run a production service without metrics, logs, and dashboards or deploy an API without knowing its latency, error rate, or cost per request.

Agentic ITOps is here. Here's what early movers are doing.

We recently brought together IT operations leaders from across financial services, healthcare, airlines, media, and other industries for BigPanda 26, our annual customer event. The theme that emerged above all others during the event’s conversations is that our industry is no longer debating whether AI belongs in ITOps. The debate now is about how quickly it can be implemented, how to measure it, and who’s accountable when it acts. Here are some key learnings from BigPanda 26.