Operations | Monitoring | ITSM | DevOps | Cloud

Your Root Cause Analysis is Flawed by Design

There’s a nagging feeling of déjà vu that haunts every network operations leader. You invest significant time and resources to resolve a major performance issue. Your best engineers isolate a culprit—a misbehaving load balancer, perhaps—and after a frantic effort, service is restored. You close the ticket, confident the problem is solved. Then, two weeks later, it’s back.

Whose Fault Is It When the Cloud Fails? Does It Matter?

On Monday, October 20th, a significant portion of the digital services we use every day became inaccessible. For hours, banking, communication, and entertainment applications were unavailable. The root cause was later identified as a major outage within Amazon Web Services (AWS), the infrastructure that powers a vast number of online services. The initial response for any business affected by such an event is a frantic effort to diagnose the problem. Is it our application? Is our network down?

Product Update - Turn Off Alerts, Use Microsoft Teams, and Custom Domains

Over the last few months IncidentHub has added several new features to make it easier to fine tune your alerts. IncidentHub now also integrates with Microsoft Teams and supports custom domains for your public status pages. Let's take a comprehensive look at what's new.

Jira Service Management (JSM) Review for Alerting (2025)

Atlassian is shutting down OpsGenie. New sales stopped on June 4, 2025, and the platform will be completely offline by April 5, 2027. As an OpsGenie user, you now face a critical decision: Migrate to Jira Service Management (JSM), Atlassian’s recommended path, or choose a different solution. And if you’re not sure JSM is the right fit for your team’s alerting needs, this review will help you decide. I signed up for JSM and put it through real-world testing.

Sliding Through Log-Time Space

This post kicks off a new series written by the Graylog Development Team. In these updates, we’ll highlight the features and fixes that make daily work in Graylog smoother. We want to show the work we care so much about and present the challenges we faced and overcame. Today, we’re starting with one of those minor but functional enhancements: Graylog time-range stepping.

Azure Cost Optimization: Best Practices for Cloud Solution Providers

In this episode, we explore practical Azure cost management strategies tailored for Cloud Solution Providers (CSPs). The conversation dives into cost visibility, optimization techniques, and billing transparency, helping CSPs improve margins and deliver more value to their customers. Featuring experts from West Coast, a leading CSP, including James Reed (Azure Sales Manager) and Mitchell G. (Azure Sales Specialist), along with Mike Stevenson, the discussion highlights real-world insights from the partner ecosystem.

Find and Fix Fastify Slowdowns with AppSignal for Node.js

In part one of this series, we set up basic performance monitoring for our Fastify application using AppSignal and explored key performance indicators. Now that we have our monitoring foundation in place, it's time to leverage these insights to actively improve application performance. You'll learn how to detect performance regressions, find optimization opportunities, and implement custom instrumentation with OpenTelemetry.

CEO Diaries: Not All AI Talent Is Alike

If Meta’s (now halted) nine-figure AI talent poaching scheme was any indication, the AI talent market is pretty frothy. The number of AI-related job postings has roughly tripled since 2019, and the average salary has more than doubled (Bain). The race is on for companies to find the fastest, most sustainable routes to AI-driven business value; all companies, but especially software companies, are hotly pursuing racers. But despite what Zuckerberg & Co.