Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Why SREs Need to Embrace Chaos Engineering

Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability. Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled.

Episode 5: Mooving to... Practical Postmortems

Episode 5, Mooving to… Practical Postmortems covers how to leverage postmortems to effectively learn from failure. Postmortems are a commonplace reference and are now considered a best practice in most modern engineering teams. However, there’s still a lot of confusion on what postmortems should be – and more importantly, what they should NOT be. Thom Duran, Senior Manager of Productivity from Panther walks us through all that and more in the latest Mooving To.. episode!

Top Incident Response Metrics & How to Use Them

Two categories a software organization should always strive to improve in are: Data analysis is one way that your organization can improve the efficiency of incident management and overall application quality. However, the questions remain – which metrics should be collected? How can analysis of these metrics facilitate these improvements? Read on to hear about five key metrics essential to incident response.

Our fully-redesigned incident response experience delivers a more intuitive workflow

Today we’re releasing fully redesigned Slack and Command Center experiences for FireHydrant so anyone on your team can intuitively navigate the incident response process — in the app or on the web. There are many things you can do ahead of an incident to help things run smoothly: design and document your process, automate predictable steps, train the team, and run drills.

Don't Let Outages Ruin Your Reputation - Prevent Them With AIOps

The world is increasingly digital. The U.S. Census Bureau estimates e-commerce grew 14.2% from 2020 to 2021, for a total of $870.8 billion in sales. And just look at the trends in remote work. According to a FlexJob and Global Workplace Analytics report, remote work has grown 44% over the last five years and an astonishing 159% over the last 12. Indeed, much of America relies on a slew of digital apps and services to get business done every day. So what does this mean for businesses?

SecOps tools - SecOps & incident management for 2022.

Importance of secOps tools – The threats in the cyber world are becoming more and more complicated and sophisticated with each passing day, while the rapid expansion of digital operations, with more nodes, networks, and servers has resulted in more vulnerabilities. This situation demands efficient SecOps teams as well as practices so that threats are thwarted, and networks and data are always protected. What is SecOps & Best SecOps tools?

AWS outage? A better way to monitor outages in Amazon Web Services

Amazon Web Services (AWS) needs no introduction. It's one of the most popular services in the world. Or actually, the most popular cloud infrastructure provider (34%) according to this study. Like in any other service, there are outages. For people running their infrastructures, there's a good chance that outages have impacted your business in the past. And the reality for AWS (or any other service) is that there's a good chance it will happen again.

A deeper dive into the Rogers outage

Beginning at 8:44 UTC (4:44am EDT) on July 8, 2022, Canadian telecommunications giant Rogers Communications suffered a catastrophic outage taking down nearly all services for its 11 million customers in what is arguably the largest internet outage in Canadian history. Internet services began to return after 15 hours of downtime and were still being restored throughout the following day.