Operations | Monitoring | ITSM | DevOps | Cloud

May 2020

Learning from Failures: Better Crash Reporting for Better Incident Response

Crash events are one of the more serious problems that can occur when operating a service. Crashing components often cause cascading failures and service outages. To reveal the magnitude of damage and help prevent future occurrences, visibility into crash events is critical. Unfortunately, debugging crashes is one of the more complicated endeavors. The state of a crashed process is often compromised and the process can’t be trusted to collect debugging information on its own.

Five Signs Your Monitoring Solution is Failing You

In a recent post I talked about the strain being placed on IT Infrastructure with the current surge in demand for online services being driven by the COVID-19 pandemic. I talked about how this sudden migration to online has exposed weaknesses in, and in some cases a total lack of, adequate monitoring practices. Unfortunately, many online sites have experienced degradation of service, poor customer experiences, and even complete outages.

COVID-19 is Placing Tremendous Strain on Online Services, Making Analytics More Important than Ever in Driving Business Success

COVID-19 is impacting nearly every company around the world. While the pandemic is affecting companies in different ways and to different degrees, a commonality many are experiencing is that the coronavirus is forcing much of our daily commerce activity online. I wrote in a post recently that literally overnight we’ve had to find new ways of working, meeting, shopping, managing healthcare, and even staying entertained.