Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Making on-call superheros

Building a world-class service is as much about maintaining software as it is about developing it. On-call engineers are typically responsible for ensuring the reliability and availability of your service i,e your reputation, and source of revenue. Robust on-call schedules ensure that the right people are ready-to-go during times of crisis. Organizations continue to depend on on-call schedules and incident response processes that are a source of stress/anxiety or panic to employees.

Incident Response 2.0 - The Zenduty Incident Command System(ICS)

We are super excited today to introduce our latest Zenduty integration with Slack, which we are calling the Zenduty Slack Incident Command System(Slack-ICS). This was many months in the making and went through multiple iterations and it is something we believe will redefine proactive incident management and response.

Incident Alert Routing - reducing noise and getting woken up only by alerts that matter

Site reliability engineers have one of, if not the, toughest roles in any organization. While dealing with incidents is one part of the job, the other is to build reliable systems. Google’s SRE book sums this approach nicely. One of the most important challenges for an SRE when it comes to balancing work between firefighting and toil reduction is the issue of alert noise.

On-call doesn't have to be stressfull

“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of on-call rotations and responsibilities that can lead to serious consequences for the services and the teams if not avoided.

The importance of GameDays

GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help facilitate the values of chaos engineering. Chaos engineering is the disciplined practice of injecting failure into healthy systems. With modern IT services becoming increasingly sophisticated continuously changing systems, outages are inevitable.

Site Reliability Engineering-Why you should adopt SRE

Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional. He and his team eventually wrote the book on SRE which is available online for free for anyone interested in research and implementation of SRE best practices.

Relationships between Operation and Devlopment Teams

Modern businesses are evolving rapidly with the advent of cloud, CI/CD and microservices. However, there still exists an extensive and obvious divide between principle business stakeholders and developmental teams. Development teams are often unaware of the challenges faced by operations teams and vice-versa. This is where a need for adoption of DevOps principles comes into the picture. DevOps which came into existence as the natural successor to Agile practices in software development.

ChatOps-The future of collaboration

ChatOps is the implementation of chatbots to unify communication and collaboration. Through ChatOps every single member of a team will be aware of what the other members are working on. It is the logical next step in the evolution of communication among teams after email and IM. Projects of today are developed at a global scale with millions of people as potential users, this means that teams are larger and often work in shifts or even remotely.

Post Mortems- Bringing clarity to incident reviews

An incident post mortem is known by many names- incident review, root cause analysis (RCA), learning review, but what do they entail?. A post mortem is a post-incident activity to help organizations understand how the incident happened and to learn from it. Service incidents are an unavoidable hurdle for any company when they do happen, the teams working will be wholly focussed on restoring service as quickly as possible.

The importance of Incident Roles

Modern technology organizations are required to be adaptive in their approach to incident management. A single project will have multiple teams working as different branches on integrated systems. Even if all the members have unified communication channels when an interruption occurs in the service there’s bound to be chaos. The frontline response team will have to be on their toes to get to the root issues at the first signs of trouble.