Operations | Monitoring | ITSM | DevOps | Cloud

A Notification List Is Not a Team

In the previous post, we looked at how alert noise is rarely accidental. It’s usually the result of sensible decisions layered over time, until responsibility becomes diffuse and response slows. One of the most persistent assumptions behind this pattern is simple. If enough people are notified, someone will take responsibility. After more than fourteen years of working with engineering teams of every size and shape, we’ve seen this assumption fail repeatedly.

Alert Noise Isn't an Accident - It's a Design Decision

In a previous post, The Incident Checklist: Reducing Cognitive Load When It Matters Most, we explored how incidents stop being purely technical problems and become human ones. These are moments where decision-making under pressure and cognitive load matter more than perfect root cause analysis. When systems don’t support people clearly in those moments, teams compensate. They add process. They add people. They add noise. Alerting is one of the most visible places where this shows up.

The Incident Checklist: Reducing Cognitive Load When It Matters Most

In the previous post, we looked at what happens after detection; when incidents stop being purely technical problems and become human ones, with cognitive load as the real constraint. This post assumes that context. The question here is simpler and more practical. What actually helps teams think clearly and act well once things are already going wrong? One answer, used quietly but consistently by high-performing teams, is the checklist.

When Things Go Wrong, Systems Should Help Humans - Not Fight Them

In the previous post, we explored how AI accelerates delivery and compresses the time between change and user impact. As velocity increases, knowing that something has gone wrong before users do becomes a critical capability. But detection is only the beginning. Once alerts fire and dashboards light up, humans still have to interpret what’s happening, make decisions under pressure, and act.

When AI Speeds Up Change, Knowing First Becomes the Constraint

In a recent post, I argued that AI doesn’t fix weak engineering processes; rather it amplifies them. Strong review practices, clear ownership, and solid fundamentals still matter just as much when code is AI-assisted as when it’s not. That post sparked a follow-up question in the comments that’s worth sitting with: With AI speeding things up, how do teams realise something’s gone wrong before users do? It’s the right question to ask next.

Make Your Engineering Processes Resilient. Not Your Opinions About AI

Why strong reviews, accountability, and monitoring matter more in an AI-assisted world Artificial intelligence has become the latest fault line in software development. For some teams, it’s an obvious productivity multiplier. For others, it’s viewed with suspicion. A source of low-quality code, unreviewable pull requests, and latent production risk. One concern we hear frequently goes something like this: It’s an understandable fear; and also the wrong conclusion.

How to monitor IPFS assets with StatusCake

IPFS stands for “InterPlanetary File System,” and it’s built on the founding principle that the web should be decentralised, resilient, and content-addressable, allowing data to be stored and shared in a way that is not reliant on centralised servers. IPFS is considered a part of “Web3”. Use cases are varied, and some examples include: Decentralised Content Hosting: Hosting websites, blogs, or documents without relying on traditional web servers.

Google's outage on the UK's hottest day of the year

We’ve all heard the jokes about how us Brits can’t handle the hot weather but when the UK hit record highs in July this year, we have to admit that we really did struggle. No more so than our friends over at Google. Google isn’t a stranger to the occasional outage and website downtime, after seeing Google Maps go down in May earlier this year. But this time, the outage was apparently due to the soaring temperatures we were experiencing.

Websites that have suffered downtime in July

You might have heard us say it before but downtime really does happen to any website, anywhere. Website downtime essentially doesn’t discriminate; it doesn’t matter if you’re a huge multi-billion dollar company or if you’re a start-up finding your feet in the online world. Downtime happens to the best of us. So to really drive this point home, we’ve put together the websites that have suffered downtime this June and how they dealt with the issue.