The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.
Yesterday April 8th 2021 at around 22:00 UTC, Facebook experienced a major outage where Facebook, Messenger, WhatsApp web and Instagram were down, lasting for as much as 3 hours. This was reported at Facebook’s status page, which was a good example of how to communicate and incident.
Here we are a full quarter into 2021, a year that took off in a huge way for us, and the momentum continues to grow strong. March was a monumental month, and now it’s a wrap. We released significant updates across the board in almost all areas of Moogsoft, including pushing innovation to newfound levels when it comes to the ease of integrating your metric and event data.
This is the fourth in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this post, Sarah and company discover how AIOps gives them "the time to save time!"
We won an award! We're excited to share that we were named the Major Incident Software Innovation of the Year 2020 at the MIM Awards. Our CEO, Robert Ross (better known as Bobby), accepted over video on our behalf (watch the video below). A lot happened for us in 2020 -- not only from winning new business, but growing as a team, and maturing our product. We're excited that MIM felt the same way about us and we're honoured to recieve this award!
The Suez Canal has been big news over the last couple of weeks. We wondered how a Site Reliability Engineer (SRE) might conduct a postmortem on what happened with the Ever Given, and what that might mean if a comparable incident occurred at a modern tech company.
When you start researching how to improve the reliability of your software, you will soon run into terms like SLOs and SLAs. It can sound intimidating, but it's quite straightforward to understand. In this post, we will introduce these terms, the differences between them and how to start using them to make your systems more reliable.
Financial services institutions have been facing pressure to modernize their operations for years. But legacy architecture and processes—along with compliance regulations—have made rapid innovation difficult to achieve. Adding to this pressure are new, digital-first competitors who accelerate the need for financial services to deliver better digital customer experiences both more consistently and at scale.
Event and alert filtering matters because alert fatigue is one of the most crucial issues in alerting and alert management. SIGNL4 implements a lightweight and effective way of filtering events. The overall process is based on alert categories. Alert categories are applied using a keyword search across the entire payload of incoming third-party events. But assigning alert categories, e.g. for alert augmentation, is not filtering.