Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Failover Conf Wrapup

Failover Conf was held on April 21, 2020, online. The folks at Gremlin came up with the idea of a virtual conference about reliability after many in-person conferences started being postponed or canceled due to COVID-19. The conference was a lot of fun to attend. I’ll be sharing some of my thoughts on the event and the talks I was able to catch. The videos for the talks haven’t been posted yet, but I’ll update this post with links to them when they are.

How PagerDuty and Partner Rundeck Enable Business Continuity for Digital Operations

At times like these when the world has been forced to adapt and go almost entirely digital, it’s imperative that our systems and platforms stay up and operational—all the times. We are going to great lengths to make sure that the hardware and software in our application stacks are reliable and responsive. Hardware is set up to have redundant backups and new code is tested and reviewed to make sure it doesn’t introduce any bugs into the system.

Darwin Was Right: Change Will Separate the Strong from the Weak

“It is not the strongest or the most intelligent who will survive, but those who can best manage change” said Charles Darwin over 150 years ago – and probably every IT Ops engineer out there these days would agree with him. According to Gartner (and probably your experience as well), over 80% of service disruptions these days are caused by changes in infrastructure and software.

Virtualize the NOC: Futureproof Your IT Investment with AIOps

By abruptly forcing most people to work from home, and by triggering an economic crisis, the global pandemic has upended business operations. Not only must business leaders facilitate remote work among their employees, but they must also accommodate new ways of interacting with suppliers, partners and customers. Meanwhile, businesses’ digital channels and infrastructure, already critical prior to the crisis, have become even more essential, and yet harder to monitor and manage.

What's New: Related Incidents, Business Response, Mobile Status Dashboard, & New Integrations

An always-on world requires a proactive and preventative approach to managing your digital operations. PagerDuty is proud to announce our latest release, which helps streamline remote remediation by providing an at-a-glance overview of your system’s health. While we’re known for on-call management and incident response, PagerDuty does much more, including providing visibility into the business impact of an incident.

Extracting Insights from Metrics with AIOps for Better Observability

In this second installment of this blog series, we’ll discuss the importance of analyzing metrics, and how AIOps helps you with this fundamental pillar of observability. Without proper metrics analysis, you’re left blind to potential outages, or possibly worse — inundated with false positive anomalies, leading to alert fatigue and ultimately business impacts. Automated discovery and analysis can’t be achieved with legacy tools nor will it scale with humans.

IT Teams Under "High Stress" Resolving Faster Than Ever Before

Seemingly simple digital moments, like checking into a flight, trigger a complex technical flow of events under the IT covers. A simple swipe or click relies on a complex IT ecosystem made up of millions of lines of code, spanning multiple software applications, hybrid and multi-cloud technologies, state-of-the-art IT infrastructure, security apps, and more.