In the first post of this series, we covered the general idea and benefits of model-driven observability with Juju. In the second post, we dived into the Juju topology and its benefits with respect to entity stability and metrics continuity. In this post, we discuss how the Juju topology enables grouping and management of alerts, helps prevent alert storms, and how that relates with SRE practices.
These days, I keep encountering inquiries from various customers on the topic of call handling. Due to the current transformation, triggered by the increased use of home offices, it is becoming more and more important to make on-call staff more accessible. Often the already overloaded service desk is used for this purpose. Of course, this leads to a) a deterioration in the quality of the service desk and b) delays between the receipt of the problem and the start of problem resolution.
At Spike.sh, our mission is to help dev teams understand and resolve production issues faster. At the core of this is our Alert Reliability Engine, whose job is to make sure that a team member always gets an alert on their preferred channel. Currently, we support 7 channels - phone call, SMS, mobile push notifications, email, Slack, Microsoft Teams and Discord. We wanted to give you a peek into how we achieve high deliverability across these channels.
HR people have a saying: right person, right place, right time, meaning that the right resources can make all the difference when it counts. The same goes for Incident management and response, where very often the wrong person, place, or time can contribute to mounting catastrophe. As systems grow, the right person really can make the difference during an outage simply due to command or knowledge of the system.
A summary of our third Moogsoft engineering Twitch Stream chatting about all things DevOps
Have you been a frustrated customer at the end of the service line waiting to achieve a resolution for your problem? After all the waiting, you'll hear a voice giving you a standard response: your request will be addressed and resolved soon. An incident need not be a harrowing experience, but can be turned into a positive customer experience using customizable and publicly accessible status pages for timely incident communication.