Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

We can't all be Shaq: why it's time for the SRE hero to pass the ball and how to get there

At a going away party from a job I was leaving a few years back, my VP of engineering told a story I didn’t even remember but that I know subconsciously shaped how I viewed my role on that team: Toward the end of my very first day at the company, there was some internal system issue, and with pretty much zero context, I pulled out my laptop, figured out what was going on, and helped fix the issue.

Tracking On-Call Health

If you have an on-call rotation, you want it to be a healthy one. But this is sort of hard to measure because it has very abstract qualities to it. For example, are you feeling burnt out? Does it feel like you’re supported properly? Is there a sense of impending doom? Do you think everything is under control? Is it clashing with your own private life? Do you feel adequately equipped to deal with the challenges you may be asked to meet? Is there enough room given to recover after incidents?

4 Best Practices for Root Cause Analysis

As failures are a common part of any system’s lifecycle - what would be the Root Cause Analysis for this type of problem? If you build and deploy a system, there are high chances that you'll have to deal with a failure in the near future. However, what matters is how you handle such failures. As an organization, you need to have pre-formulated strategies to handle failures as and when they occur.

List of Potential Incident Management Issues

Incident management is the process followed by the area of IT service management to respond to a service disruption, in order to restore it to normal as quickly as possible, minimizing the negative impact on the business. An incident is a single unplanned event that generates a service disruption, whereas a problem is a cause or potential cause of one or more incidents, as defined by ITIL incident management guidelines.
Sponsored Post

Major Incident Process Is at the Heart of Effectiveness

Read the new white paper on major incident management. Businesses need to be prepared for minor and major incidents to happen to their technologies, be it an integration disconnecting or an entire system being taken offline. Preparation ensure that not only can losses be minimized, but they can protect themselves and potentially their clients from risky impacts.

Making waves in IT Ops

It feels a bit surreal stepping into the Regional Vice President of Sales position here at BigPanda just a few months after the company achieved Unicorn status. In more than 15 years of managing enterprise software sales, this is the first time I knew I was going to play a critical role in facilitating a company’s ascension to the top of their sector. Even in college, I knew this is what I wanted.

How StatusCast makes managing incidents smarter in Slack

These days, more and more IT teams spend much of their workday in Slack. It’s essentially a second virtual home. For those employees who find Slack their main source of communication, it stands to reason that you need to access tools, bots, apps, and more – directly within the Slack environment. You shouldn’t have to leave your home to get your work done, and you shouldn’t have to leave Slack to communicate with and update your team and your clients.