Latest Posts

Making on-call superheros

Dec 27, 2019 By Amrit Balraj In Zenduty

Building a world-class service is as much about maintaining software as it is about developing it. On-call engineers are typically responsible for ensuring the reliability and availability of your service i,e your reputation, and source of revenue. Robust on-call schedules ensure that the right people are ready-to-go during times of crisis. Organizations continue to depend on on-call schedules and incident response processes that are a source of stress/anxiety or panic to employees.

Read Post

Zenduty

Read more about Making on-call superheros

Incident Response 2.0 - The Zenduty Incident Command System(ICS)

Dec 15, 2019 By Vishwa Krishnakumar In Zenduty

We are super excited today to introduce our latest Zenduty integration with Slack, which we are calling the Zenduty Slack Incident Command System(Slack-ICS). This was many months in the making and went through multiple iterations and it is something we believe will redefine proactive incident management and response.

Read Post

Zenduty

Read more about Incident Response 2.0 - The Zenduty Incident Command System(ICS)

Incident Alert Routing - reducing noise and getting woken up only by alerts that matter

Dec 10, 2019 By Vishwa Krishnakumar In Zenduty

Site reliability engineers have one of, if not the, toughest roles in any organization. While dealing with incidents is one part of the job, the other is to build reliable systems. Google’s SRE book sums this approach nicely. One of the most important challenges for an SRE when it comes to balancing work between firefighting and toil reduction is the issue of alert noise.

Read Post

Zenduty

Read more about Incident Alert Routing - reducing noise and getting woken up only by alerts that matter

On-call doesn't have to be stressfull

Nov 29, 2019 By Amrit Balraj In Zenduty

“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of on-call rotations and responsibilities that can lead to serious consequences for the services and the teams if not avoided.

Read Post

Zenduty

Read more about On-call doesn't have to be stressfull

The importance of GameDays

Nov 18, 2019 By Amrit Balraj In Zenduty

GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help facilitate the values of chaos engineering. Chaos engineering is the disciplined practice of injecting failure into healthy systems. With modern IT services becoming increasingly sophisticated continuously changing systems, outages are inevitable.

Read Post

Zenduty

Read more about The importance of GameDays

Site Reliability Engineering-Why you should adopt SRE

Nov 11, 2019 By Amrit Balraj In Zenduty

Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional. He and his team eventually wrote the book on SRE which is available online for free for anyone interested in research and implementation of SRE best practices.

Read Post

Zenduty

Read more about Site Reliability Engineering-Why you should adopt SRE

Relationships between Operation and Devlopment Teams

Oct 16, 2019 By Amrit Balraj In Zenduty

Modern businesses are evolving rapidly with the advent of cloud, CI/CD and microservices. However, there still exists an extensive and obvious divide between principle business stakeholders and developmental teams. Development teams are often unaware of the challenges faced by operations teams and vice-versa. This is where a need for adoption of DevOps principles comes into the picture. DevOps which came into existence as the natural successor to Agile practices in software development.

Read Post

Zenduty

Read more about Relationships between Operation and Devlopment Teams

ChatOps-The future of collaboration

Oct 7, 2019 By Amrit Balraj In Zenduty

ChatOps is the implementation of chatbots to unify communication and collaboration. Through ChatOps every single member of a team will be aware of what the other members are working on. It is the logical next step in the evolution of communication among teams after email and IM. Projects of today are developed at a global scale with millions of people as potential users, this means that teams are larger and often work in shifts or even remotely.

Read Post

Zenduty

Read more about ChatOps-The future of collaboration

Post Mortems- Bringing clarity to incident reviews

Oct 3, 2019 By Amrit Balraj In Zenduty

An incident post mortem is known by many names- incident review, root cause analysis (RCA), learning review, but what do they entail?. A post mortem is a post-incident activity to help organizations understand how the incident happened and to learn from it. Service incidents are an unavoidable hurdle for any company when they do happen, the teams working will be wholly focussed on restoring service as quickly as possible.

Read Post

Zenduty

Read more about Post Mortems- Bringing clarity to incident reviews

The importance of Incident Roles

Sep 30, 2019 By Amrit Balraj In Zenduty

Modern technology organizations are required to be adaptive in their approach to incident management. A single project will have multiple teams working as different branches on integrated systems. Even if all the members have unified communication channels when an interruption occurs in the service there’s bound to be chaos. The frontline response team will have to be on their toes to get to the root issues at the first signs of trouble.

Read Post

Zenduty

Read more about The importance of Incident Roles

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Making on-call superheros

Incident Response 2.0 - The Zenduty Incident Command System(ICS)

Incident Alert Routing - reducing noise and getting woken up only by alerts that matter

On-call doesn't have to be stressfull

The importance of GameDays

Site Reliability Engineering-Why you should adopt SRE

Relationships between Operation and Devlopment Teams

ChatOps-The future of collaboration

Post Mortems- Bringing clarity to incident reviews

The importance of Incident Roles

Monthly Archive

Follow Us