Bangalore, India
Jul 2, 2020   |  By Ankur Rawal
Over the years there have been a bunch of great talks on site reliability and incident response. Below are a few we thought stood out(in no specific order) and is defintely worth a peek.
Jun 29, 2020   |  By Vishwa Krishnakumar
For folks who’ve made post mortems more meaningful at your company, it is important that you spread that learning around. A lot of companies have teams that do postmortems really well and a lot of engineering managers(EMs) want to spread it organically, but writing and following postmortems is the kind of practice that a lot of devs really just don’t think about or care about and it can get extremely hard to force this practice, especially without support from upper management.
Jun 7, 2020   |  By Ankur Rawal
In the following example, we will be creating a histogram in Grafana. Our datasource is Prometheus’s cumulative histogram. I have captured the metrics using micrometer’s distribution summary.
Jun 7, 2020   |  By Ankur Rawal
One of the most popular cloud disaster recovery models in the industry today is the “pilot light” model where critical applications and data are in already place so that it can be quickly retrieved if needed. A simple question one must ask before adopting this model is what thought has been given to whether the AWS/GCP/Azure APIs will work and if the requisite capacity will be available in the alternate region.
Jun 6, 2020   |  By Ankur Rawal
This tip is for those who are using Prometheus federation to monitor multiple clusters. How should alertmanager be configured for multiple clusters? Let us say that if there’s an issue for Cluster A it only needs to send an alert for cluster A? In such cases, every alert should be routed to proper team based on labels (if there is problem with application A on cluster B - team responsible should be notified). In the above case, two alerts are triggered by the same rule.
Jun 8, 2020   |  By Zenduty
Microsoft Dynamics is a line of enterprise resource planning and customer relationship management software applications. Microsoft markets Dynamics applications through a network of reselling partners who provide specialized services. Microsoft Dynamics forms part of "Microsoft Business Solutions". The Zenduty-Dynamics integration helps you escalate critical cases/incidents to the right team, proactively alert them about SLA violations and bring in SMEs and stakeholders into high priority cases. To know more about the Integration,
May 29, 2020   |  By Zenduty
Incident Priorities and SLAs in Zenduty Incident SLAs let you set acknowledgement and resolution SLAs for your incidents. SLAs allow your teams to prioritize incidents as well as increase transparency amongst incident stakeholders - support, account managers and management. Incident priority is the sequence in which an Incident or Problem needs to be resolved, based on Impact and Urgency. Priority also defines response and resolution targets associated with Service Level Agreements. Each team in Zenduty can define their own priorities like P0/P1/P2/P3 or L0/L4/L16 etc.
Dec 16, 2019   |  By Zenduty
Watch the Zenduty Incident Command System in action!