Operations | Monitoring | ITSM | DevOps | Cloud

AIOps

The latest News and Information on AIOps, alerting in complex systems and related technologies.

Why is Causation Important in AIOps?

Modern IT environments have become much more complex to manage thanks to hybrid infrastructures and comprehensive instrumentation that generate metrics, alerts and events data constantly. ITOps (IT Operations) and SRE (Site Reliability Engineering) teams are tasked with providing superior performance and user experience for the numerous applications while not letting the budget out of hand.

Episode 3: Mooving to... Stability: The Role of Catastrophic Failure in Software Design

In this episode of Mooving to… Stability: The Role of Catastrophic Failure in Software Design, we had the opportunity to chat with Jeff Atwood, yes that Jeff Atwood of, Coding Horror, Stack Overflow, and Discourse (Chief Happiness Officer). Jeff started writing 911 software in Boulder, Colorado for a small company, which was a crash-course in writing code for software that has real consequences. With this unique and deep perspective, B.J.

What Is Government Digital Transformation?

The U.S. federal government knows it has not kept pace with technology innovation. Recent legislation and a $1 billion modernization fund aim to bring the federal government up-to-date. What does government digital transformation mean, and what are federal IT leaders doing to modernize their agency’s IT?

How Many Tools Do ITOps Teams Need to Observe?

In the recent past, every enterprise has had to deal with an outage, leading to war rooms where ITOps teams are put on the spot. While they take on the burden of ensuring 100% uptime, it is often the tools they employ which don’t live up to their promises. Especially in the wake of the pandemic, with working norms being redefined, ITOps teams have been under even greater pressure to deliver. While they strive to be efficient and rely on cutting-edge technology, uptime is often elusive.

mooving To...Stability

Join seasoned veteran, Jeff Atwood (yes, that Jeff Atwood of Stack Overflow and Discourse) as he discusses the role of catastrophic failure in software design. Users of modern apps require as close to 100% uptime as possible, which also means they require quick results. When these expectations aren't met, we need to learn from them to create better design. But what if your fault tolerance design ends up being the cause of your issues? Sean Molloy, and BJ Maldonado talk with Jeff about how you can learn from failure to improve your software.

AIOps in 2022 and Beyond: A Conversation with Gartner

Modern digital businesses adopt AIOps tools to enable continuous insights across an IT stack. These insights tell the full story of what’s happening behind systems, allowing IT teams to achieve the operational efficiencies and high availability that lead to customer satisfaction. Old siloed monitoring disciplines provide data specific to performance of the digital experience, IT infrastructure, application or network.