The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.
Sometimes, two concepts overlap so much that it’s hard to view them in isolation. Today, incident management and problem management fit this description to a tee. This wasn’t always the case. For a long time, these two ITIL concepts were seen as distinct—with specialized roles overseeing each. Incident management existed in one corner and problem management in the other. Then came the DevOps movement and the lines suddenly became blurred. So where do they stand today?
Incident management is a critical aspect of IT service management (ITSM) that revolves around restoring normal service operations as swiftly as possible after an unplanned interruption or reduction in quality. Also referred to as “incidents,” these interruptions could range from a minor issue like a single user being unable to access a service to a significant problem such as a server crash or network outage affecting many users.
Alexander is Senior SRE at Prezi, a video and visual communications software company. As a team, the Prezi SREs provide multiple services within the company. One of those is the observability stack where Prezi heavily relies on Grafana. Companies are always evolving to run more smoothly, serve their customers better, and operate in a way that is cost-effective.
As Site Reliability Engineering (SRE) continues to grow in popularity, many professionals are looking for ways to advance from junior to senior roles. While there is no one-size-fits-all approach, the transition from junior to senior SRE is marked by a gradual increase in experience and a set of key skills. In this blog, we will explore the valuable insights and strategies shared by experienced SREs.
In today's digital landscape, most people understand that no system is perfect and data is never 100% safe. Incidents are bound to happen. How people learn about those incidents often influences their reactions. Mishandled incident communication can have drastic consequences for your company. For starters, it can drag out the incident response and harm your bottom line.