Monthly Archive

Error Budget in SRE: The Complete Guide (2026)

May 20, 2026 By Nuno Tomas In isDown

An error budget is the acceptable amount of unreliability permitted by your SLO over a defined time window. It is not a target. It is not a stretch goal. It is a hard ceiling that, when breached, should trigger a pre-agreed organizational response — feature freezes, postmortems, or infrastructure investment. The formula is blunt: Error Budget = 1 - SLO Target Error Budget (time) = (1 - SLO Target) × Window Duration For a 30-day window: That last number should make you uncomfortable.

Read Post

isDown

Read more about Error Budget in SRE: The Complete Guide (2026)

Cloud Outage History: Six Years of Recurring Failures

May 13, 2026 By Nuno Tomas In isDown

Cloud infrastructure has never been more reliable in theory. In practice, the last six years of cloud outage history have delivered some of the most disruptive incidents on record. Not because cloud providers got worse, but because the systems built on top of them got larger, more interconnected, and more brittle in ways that don't show up until everything breaks at once.

Read Post

isDown

Read more about Cloud Outage History: Six Years of Recurring Failures

How to Reduce MTTR When Third-Party Services Go Down

May 7, 2026 By Nuno Tomas In isDown

Most MTTR guides assume the problem is in your infra. For modern apps, it's often not - it's Stripe, AWS, Auth0, or another vendor. Vendor status pages lie by omission. The lag between impact and acknowledgment can stretch to an hour or more. You need two runbooks, proactive vendor monitoring, and graceful degradation baked in before the 3 AM page hits. This post shows you exactly how.

Read Post

isDown

Read more about How to Reduce MTTR When Third-Party Services Go Down

April 2026: IsDown Users Saved 16.5 Hours with Early Outage Detection

May 3, 2026 By Nuno Tomas In isDown

In April 2026, IsDown's early detection system gave users a 3.6-hour head start on a major outage — plenty of time to implement workarounds before the vendor even acknowledged the problem. Across 45 early detections, our users saved a collective 16.5 hours by knowing about outages an average of 22 minutes before official status pages were updated.

Read Post

isDown

Read more about April 2026: IsDown Users Saved 16.5 Hours with Early Outage Detection

Operations | Monitoring | ITSM | DevOps | Cloud

Error Budget in SRE: The Complete Guide (2026)

Cloud Outage History: Six Years of Recurring Failures

How to Reduce MTTR When Third-Party Services Go Down

April 2026: IsDown Users Saved 16.5 Hours with Early Outage Detection

Monthly Archive

Follow Us