Operations | Monitoring | ITSM | DevOps | Cloud

December 2019

Making on-call superheros

Building a world-class service is as much about maintaining software as it is about developing it. On-call engineers are typically responsible for ensuring the reliability and availability of your service i,e your reputation, and source of revenue. Robust on-call schedules ensure that the right people are ready-to-go during times of crisis. Organizations continue to depend on on-call schedules and incident response processes that are a source of stress/anxiety or panic to employees.

Incident Response 2.0 - The Zenduty Incident Command System(ICS)

We are super excited today to introduce our latest Zenduty integration with Slack, which we are calling the Zenduty Slack Incident Command System(Slack-ICS). This was many months in the making and went through multiple iterations and it is something we believe will redefine proactive incident management and response.

Incident Alert Routing - reducing noise and getting woken up only by alerts that matter

Site reliability engineers have one of, if not the, toughest roles in any organization. While dealing with incidents is one part of the job, the other is to build reliable systems. Google’s SRE book sums this approach nicely. One of the most important challenges for an SRE when it comes to balancing work between firefighting and toil reduction is the issue of alert noise.