Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

The Cost of IT Downtime: An Overview

As the adoption of cloud computing continues to encourage innovation across industries, high-performing and resilient systems have become a necessity in order to keep pace with the competition and meet internal/external SLAs (service level agreements). In terms of customer expectations, a minute of downtime can mean thousands of dollars in lost opportunity and a soiled customer relationship. So what exactly is downtime?

Having On-call Nightmares? Runbooks can Help you Wake Up.

You aren't sure how long you've been here, but the view outside the window sure is soothing. Before you can fully take in your surroundings, a siren rips you back into the conscious world. Slowly, you begin to piece together that you exist, and you are on call. The ringing, much louder now, pierces through your skull as you begin to open your bleary eyes. You turn over your pillow, grab your phone, and click through the PagerDuty notification.

Using Remote Actions to Create ServiceNow Incidents

Recently we have received a lot of requests for Enterprise Alert to not only alert on critical situations but to also take a proactive approach to initiate, record and track those situations through ITSM tools such as ServiceNow and BMC Remedy. This post will center around what happens when critical systems fail and tickets are not being created in ServiceNow due to a break in the workflow.

Key Learnings from the Facebook Status Page

Yesterday April 8th 2021 at around 22:00 UTC, Facebook experienced a major outage where Facebook, Messenger, WhatsApp web and Instagram were down, lasting for as much as 3 hours. This was reported at Facebook’s status page, which was a good example of how to communicate and incident.

Monthly Moo Update | March 2021

Here we are a full quarter into 2021, a year that took off in a huge way for us, and the momentum continues to grow strong. March was a monumental month, and now it’s a wrap. We released significant updates across the board in almost all areas of Moogsoft, including pushing innovation to newfound levels when it comes to the ease of integrating your metric and event data.

Reduce Toil with Better Alerting Systems

If not tackled early, increasing toil can affect the morale and productivity of your SRE team. In this blog we look at some of the ways you can counter toil with the help of better alerting systems in place. Are you an SRE or On-call engineer struggling to manage toil? Toil is any repetitive or monotonous activity that can lead to frustration within an incident management team. Also at the business level, toil doesn't add any functional value towards growth and productivity.