Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How we achieved pixel-perfect polish during our Status Pages launch

A few months ago, we released Status Pages. This project was quite different from anything we’ve approached before, given that: And our goals were a departure from one's we had set in the past: With this in mind, we worked closely with our designer throughout the process of building Status Pages. Here is how we approached it and a few lessons we learned along the way!

Catalog vs. Thanos: Who came out on top?

Catalog is really, really powerful. To prove it, our latest product went up against the almighty Thanos and won decisively. Don’t believe us? Just look at how unscathed Catalog was once the dust settled: All jokes aside, we spent months building out what, we think, is one of the most capable products on the market today. Designed to be a map of everything that exists in your organization Catalog can meaningfully help you level up your incident response.

Powering ConnectWise PSA With a New Alerting Workflow

In our previous blog from the ConnectWise series titled “OnPage-ConnectWise Incident Alert Management Workflows,” we discussed how customers are optimizing their investments in ConnectWise PSA. Now, we’re excited to present a new and powerful workflow specifically designed for after-hours that addresses the evolving needs of IT and Managed IT clients.

Understanding Chaos Engineering and its Benefits

In today's fast-paced technological landscape, ensuring the resilience and dependability of systems is crucial. This is where Chaos Engineering comes in, transforming how organizations approach system testing and fortification. Chaos Engineering helps find vulnerabilities that could go undetected under normal circumstances by purposefully introducing controlled interruptions and failures.

MTTR vs. MTBF vs. MTTF: Understanding Failure Metrics

In the dynamic landscape of software and web applications, failures can have severe consequences, impacting user experience, business continuity, and overall performance. To proactively address these challenges, organizations rely on robust monitoring practices supported by failure metrics. Failure metrics, specifically tailored to software and web application monitoring, provide crucial insights into system health, reliability, and optimization opportunities.

The Importance of Log Monitoring for Incident Response

In the face of growing security threats and incidents, businesses must prioritize their ability to detect, investigate, and respond effectively. Timely incident response is crucial for maintaining the security and integrity of systems and data. Among the essential tools in the incident response arsenal, log monitoring stands out as a critical component. By closely analyzing logs, organizations gain valuable insights into system events, user activities, and network traffic.

26 DevOps Automation Tools that SaaS Loves in 2023 | Blameless

DevOps is a term combining “development” and “operations”. It involves the use of tools and processes to minimize the time and effort spent on software creation and maintenance. Many DevOps technologies use automation to reduce manual tasks. These DevOps automation tools sometimes use AI-based technology to remove human-based operations, or simpler scripting and processing. This increases speed in feedback and performance between development and operations departments.

SIGNL4 Onboarding: Alert Notifications & Handling

The SIGNL4 Onboarding series walks users through the process's of SIGNL4 from Signup to Alerts to Settings. Today's video focuses on receiving alerts and all of the options available inside of your SIGNL4 alerts. This video is packed with helpful tips to help you get the most out of your account.

The Unplanned Show, Episode 4: Sriram Subramanian on Responsible Generative AI

Generative AI is a rapidly-evolving ecosystem with a lot of attention. In this episode, Dormain Drewitz asks Sriram Subramanian about the main challenges to responsibly implement generative AI, including content that’s harmful, inaccurate or violates privacy or security standards. Sriram discusses Microsoft’s 6 tenets to responsible generative AI, as well as the notion of shared responsibility between platform providers and foundational LLMs and the developers and data engineers building on top. Sriram also answers questions about where to get started safely with generative AI and shares his framework for identifying opportunities to add value.