Operations | Monitoring | ITSM | DevOps | Cloud

uptime

What Does 99.9% Uptime Mean?

An old adage about choosing a hosting provider says that everyone promises 99.9% uptime so you need to test uptime of a site for the real picture. Or scour the forums for reviews and judge for yourself how reliable they are. That works too. What that saying is really getting at is the need for some kind of indicator that uptime does not fall below expectations, because you can’t just trust the word of the provider when your business is at stake.

March 2020 Outage Report

It’s pretty safe to say that March was the month where everything changed for most of us. By now, enough has been said on coronavirus and we need not add to the pile. Our concern remains continuous uptime, and reporting on outages as teachable moments. During this time of heightened tensions, let’s take a few moments to do some post mortem work and see what we can learn from March’s outages.

The Uptime.com Report for 2019

Unplanned downtime can drive significant losses in the form of unrealized revenue. Teams may be caught off guard, or may face an outage outside their control, extending downtime hours unnecessarily. Without automated monitoring and alerting, teams face undetected outages that silently threaten SLA fulfillment. The recommendations in this report are best used as a guide on what trends may drive Site Reliability Engineering in the near term.

IT and DevOps Resources for COVID-19

We’re all wrestling with less than ideal circumstances during the pandemic of COVID-19. Whether you’re sheltering in place or simply practicing social distancing, it’s safe to say we’re all adjusting to a temporary new normal. One commonality is the need for connectivity. If infrastructure fails, business will screech to a halt and we will find ourselves in a new kind of mess altogether.

What Makes SSL Fail, and What Can SREs Do About It?

TLS (and the previously used SSL) protocols make the web go round. They are fundamental when establishing a link between two computers, creating a very special mathematical relationship signified by the all-encompassing gesture of friendship: the handshake. So fundamental, in fact, that we probably take them for granted when we shouldn’t. The user relies on TLS encryption every day to protect data and the integrity of a session.

February 2020 Downtime Report

February kicked 2020 off with a terrifying glimpse into what happens when the Internet of Things stops Internetting things. If we consider our central question this year of uptime in the age of always-connected, then we start to see the impact of hidden failures. All the stuff we don’t know we know impacts the end-user. Someone forgets to renew a TLS certificate, half the business world can’t collaborate. Someone else flubs an update?

How to Improve Downtime Response: Error Budgeting and Unplanned Downtime

Every one of us reading this blog has seen a fire spring up and quietly walked away from the impending chaos. And everyone one of us has managed to live this long because we understand when to react to a fire. A real fire affects our Service Level Objectives (SLO), and affects the user base. You need to figure out where it is, what started it, and what your team will do about it, and you need to do that now.

Why Your Status Page Matters and How to Use It

When an outage hits your service, everybody starts talking. Your engineers are talking about what caused the problem, and how to fix it; your management is asking about when it’ll be fixed; and your customers are telling the world that they’re not happy. But there’s an even more important conversation you should be having: communicating with your users about the issue.

January 2020 Outage Report

Welcome to 2020, where Google Drive can fail for some of you but not others, you can’t access your passwords, and you can’t withdraw cash on vacation. This stranded on a desert isle dream was reality in the month of January, which saw drama in the financial services and internet infrastructure sectors. January’s downtime reinforces just how connected we have become, and how reliant we are on infrastructure that can seemingly fail on a whim.

Transaction Monitoring | Upgrades and Use Cases in 2020

Synthetic monitoring takes care of all of the small interactions on our website that QA can’t catch. If you’re building an application for the web, a transaction check is an integral part of proactive downtime resolution. What we call transaction monitoring, or a transaction check, is a set of instructions that a probe server follows.