Monthly Archive

5 Reasons to Switch from PagerDuty to a More Effective Alternative

Jul 31, 2024 By Vishal Padghan In Squadcast

When it comes to Incident Management, having the right tool can make all the difference between a swift resolution and prolonged downtime. While PagerDuty has long been a staple in the industry, many teams are finding more effective alternatives that better align with their needs and offer significant advantages. Here, we explore five compelling reasons to consider switching from PagerDuty to more efficient alternatives.

Read Post

Squadcast

Read more about 5 Reasons to Switch from PagerDuty to a More Effective Alternative

Reducing Coordination Costs in Incident Response

Jul 31, 2024 By Mandi Walls In PagerDuty

Incidents can happen anywhere at any time. They can be small, well-defined, and easily contained. They can be large, messy, and complex, like the major outage we saw recently. Or they can be somewhere in between. When incidents occur, mobilizing and coordinating responders is crucial to restoring service, protecting the customer experience, and mitigating business risks.

Read Post

PagerDuty

Read more about Reducing Coordination Costs in Incident Response

Redefining incident management: the power and pitfalls of AI

Jul 31, 2024 By Incident.io In Incident.io

Like it or not, AI is having a monumental impact on our lives. Most of the products we engage with today have AI features and functionality, aimed at assisting or completely replacing the actions normally taken by humans. When it comes to incidents, we’re firm believers of accelerating human actions, and believe the risk of over-automation far outweighs the benefits. In this live event we’ll dig a little deeper on why, as we cover the power and pitfalls of AI.

View Video

Incident.io

Read more about Redefining incident management: the power and pitfalls of AI

The Best SRE Tools To Improve Reliability and Streamline Operations

Jul 31, 2024 By Iryna Iurchenko In Rootly

For better or worse, most companies—including their execs and developers—see SREs as superheroes who’ll save them from the evils of downtime and service degradation with their boundless superpowers. SREs are expected to constantly perform dangerous stunts like production debugging or communicating highly technical issues to angry VPs. They must also be able to manage infrastructure, networks, databases, pipelines, operating systems and much more.

Read Post

Rootly

Read more about The Best SRE Tools To Improve Reliability and Streamline Operations

PagerDuty Expands Generative AI Solutions with PagerDuty Advance to Mitigate Risk of Operational Outages

Jul 30, 2024 By PagerDuty In PagerDuty

With AI-powered capabilities, enterprises can accelerate strategic roadmap initiatives, build more resilient operations and drive digital transformation initiatives.

Read Post

PagerDuty

Read more about PagerDuty Expands Generative AI Solutions with PagerDuty Advance to Mitigate Risk of Operational Outages

Integrating Incident Management with Your Existing Systems: A Step-by-Step Guide

Jul 30, 2024 By Vishal Padghan In Squadcast

Streamline IT operations by integrating incident management platform with your existing systems. Boost response times, enhance collaboration, and ensure reliability with our step-by-step guide.

Read Post

Squadcast

Read more about Integrating Incident Management with Your Existing Systems: A Step-by-Step Guide

Automated incident response in ITOps

Jul 30, 2024 By Amy Brennen In BigPanda

Most IT leaders realize that automating repetitive, low-level incident response actions is vital to multiple benefits. To name just a few, these include: In IT, incident response refers to addressing any event that disrupts normal service, application, security operation, or performance. Using AI and machine learning, automation addresses incident analysis, detection, investigation, triage, and response. The question is often identifying where to start or the best approach.

Read Post

BigPanda

Read more about Automated incident response in ITOps

Understanding Mean Time to Resolve

Jul 30, 2024 By Pablo Sencio In InvGate

Back in the day, IT teams often spent countless business hours manually sifting through logs, diagnosing issues, and identifying the root cause of a system failure. This painstaking process frequently led to prolonged downtimes and frustrated users. Today, organizations can’t afford such inefficiencies. Keeping systems running smoothly is key, and that’s where critical metrics like Mean Time to Resolve (MTTR) come into play.

Read Post

InvGate

Read more about Understanding Mean Time to Resolve

Mitigate the Risk of Operational Failure with PagerDuty Advance, GenAI for Every Step of the Incident Lifecycle

Jul 30, 2024 By Débora Cambé In PagerDuty

As organizations increasingly rely on complex digital infrastructure, they must be ready to move rapidly when major incidents occur. The recent global outage has shown just how fragile IT systems can be. With mounting pressure to deliver seamless customer experiences, GenAI and automation present an opportunity to manage risk more effectively, by ensuring responders have the right information to restore services quickly.

Read Post

PagerDuty

Read more about Mitigate the Risk of Operational Failure with PagerDuty Advance, GenAI for Every Step of the Incident Lifecycle

PagerDuty Advance | Generative AI for PagerDuty Operations Cloud

Jul 30, 2024 By PagerDuty In PagerDuty

Introducing PagerDuty Advance: GenAI for critical operations work. For every step of the incident lifecycle. For scaling your teams. For sustaining customer experiences. For moving business forward – faster. Work more efficiently. Protect more revenue. Build greater operational resilience. PagerDuty Advance helps operations teams manage business-impacting issues in seconds, not hours. From event to resolution, PagerDuty Copilot’s automations help you resolve issues faster, reduce risk, and control costs.

View Video

PagerDuty

Read more about PagerDuty Advance | Generative AI for PagerDuty Operations Cloud

Drive Operational Excellence featuring PagerDuty Advance

Jul 30, 2024 By PagerDuty In PagerDuty

Build operational excellence with PagerDuty. Watch this demo to see how the latest innovations for the PagerDuty Operations Cloud come together to help a team tackle a major incident related to a database upgrade. You’ll see how PagerDuty Advance capabilities work in concert with new functionality built for modernizing operations centers, standardizing automation at scale, and transforming incident management. The result? Improved innovation velocity, reduced operating costs, and better customer experiences.

View Video

PagerDuty

Read more about Drive Operational Excellence featuring PagerDuty Advance

Microsoft Outage MO842351: Understanding Impact & Scope Saves You From Raising Unnecessary Alarm Bells

Jul 30, 2024 By Amanda Griebeler In Martello Technologies

Just ten days after the last major Microsoft 365 outage, Microsoft reported another incident at 8:48 am on July 30, 2024. The message on X was vague, offering limited details about the scope and impact of the problem. This left many IT teams preparing for what they anticipated would be another rocky day.

Read Post

Martello Technologies

Read more about Microsoft Outage MO842351: Understanding Impact & Scope Saves You From Raising Unnecessary Alarm Bells

Optimizing Incident Management: Effective Stakeholder Communication with Squadcast

Jul 29, 2024 By Spandan Pal In Squadcast

When a critical system goes down, every minute counts. Amid the chaos, it's easy to overlook a crucial aspect of Incident Management: keeping stakeholders informed. However, neglecting stakeholder communication can have disastrous consequences, including misinformation, delayed decisions, and frustration. Effective stakeholder communication is essential for ensuring a coordinated, efficient, and transparent response to incidents.

Read Post

Squadcast

Read more about Optimizing Incident Management: Effective Stakeholder Communication with Squadcast

Where does the time go after you resolve an incident?

Jul 29, 2024 By Eryn Carman In Incident.io

We were curious: once an incident is over, how long does it take companies to document, review, create learnings, finish clean-up items, and complete any other follow-up action items? We work with a wide variety of companies, from small start-ups to Enterprises with thousands of engineers. But we wanted to know: where is their time spent after they resolve an incident? Here’s what we found!

Read Post

Incident.io

Read more about Where does the time go after you resolve an incident?

25 Best Incident Management Software and Communication Platforms 2024

Jul 29, 2024 By Colin Bartlett In StatusGator

In 2024, only 45% of companies have an incident response plan in place. If your organization is among the 55% without one, it’s crucial to change that. Service outages are inevitable. Cyberattacks and information security threats are more prevalent than ever. So having the right incident management software can be a game-changer for your organization, helping you respond swiftly and effectively when issues arise. The challenge, however, lies in selecting the right incident management solution.

Read Post

StatusGator

Read more about 25 Best Incident Management Software and Communication Platforms 2024

Enhancing Transparency in Incident Management with SIGNL4

Jul 29, 2024 By SIGNL4 In SIGNL4

Effective incident management is crucial for businesses to maintain smooth operations and customer satisfaction. However, ensuring transparency throughout the incident resolution process can be challenging. This is where SIGNL4 steps in, offering a comprehensive solution that enhances transparency at every step of incident handling.

View Video

SIGNL4

Read more about Enhancing Transparency in Incident Management with SIGNL4

Incidents are lessons, not failures

Jul 26, 2024 By Eduardo Crespo, VP of EMEA In PagerDuty

Delivering digital operations excellence - DevOps, incident management, and keeping organisations running - is a constant challenge. As customer digital expectations rise, so do the complexities of the tech stack and cloud services integrations. But to insist on 100% uptime and rush through incident management without taking learnings into account creates a poor culture that can damage the ability of the DevOps team. This is not how a business creates resilient infrastructure and high-performing teams.

Read Post

PagerDuty

Read more about Incidents are lessons, not failures

Creating Schedules + Escalation Policies with Rootly On Call

Jul 26, 2024 By Rootly In Rootly

Ashley walks you through how to create a schedule and escalation policy using Rootly On-Call, a modern on-call and incident management solution. rootly.com/on-call.

View Video

Rootly

Read more about Creating Schedules + Escalation Policies with Rootly On Call

Rootly On-Call: On-Call Shadowing Feature

Jul 26, 2024 By Rootly In Rootly

Shadowing experienced responders is one of the most effective ways for folks who are new to on-call to gain the confidence and knowledge to handle incidents independently. Traditionally, shadow rotations are cumbersome to set up, involving duplicating and editing an existing schedule. For Rootly On-Call users, setting up shadow rotations couldn’t be easier with our new native Shadowing feature. Here are a few highlights.

View Video

Rootly

Read more about Rootly On-Call: On-Call Shadowing Feature

NYSE uses AIOps to identify problems faster and focus on innovation

Jul 26, 2024 By BigPanda In BigPanda

The New York Stock Exchange relies on AIOps to extract crucial incident insights, allowing IT teams to focus on innovation instead of manually investigating alert data. Chuck Adkins, CIO, shares how an AIOps tool helps the NYSE save time and resolve problems instead of searching through alerts to find them.

View Video

BigPanda

Read more about NYSE uses AIOps to identify problems faster and focus on innovation

Reduce IT Noise Up to 98% with Alert Correlation

Jul 26, 2024 By BigPanda In BigPanda

Use a powerful AI/ML-driven alert correlation engine to group related alerts into meaningful, actionable incidents in real time. BigPanda provides default correlation patterns and gives you the option to tailor patterns to your organization. #aiops #itoperations #incidentmanagement.

View Video

BigPanda

Read more about Reduce IT Noise Up to 98% with Alert Correlation

Enable ilert Intelligent Alert Grouping

Jul 26, 2024 By iLert In iLert

Intelligent alert grouping is a new feature of ilert. It is powered by ilert AI and designed to prevent alert fatigue. The feature combines alerts into groups based on their content. Our video explains how to enable alert grouping for your alert source and how to adjust the accuracy of the grouping. The feature is a part of the new powerful ilert add-on and is currently available at no extra cost during the Beta phase.

View Video

iLert

Read more about Enable ilert Intelligent Alert Grouping

Leveraging AI for Efficient On-call Scheduling

Jul 26, 2024 By Sirine Karray In iLert

Regardless of industry specifications, creating and maintaining a highly functional incident management process is crucial for organizations of all sizes. The various potential applications of Generative AI in this process can significantly enhance the efficiency, accuracy, and speed of incident detection, analysis, and resolution. GenAI can be utilized across all stages of the incident management process, including preparation, response, communication, and learning.

Read Post

iLert

Read more about Leveraging AI for Efficient On-call Scheduling

How our data team handles incidents

Jul 26, 2024 By Navo Das In Incident.io

Historically, data teams have not been closely involved in the incident management process (at least, not in the traditional “get woken up at 2AM by a SEV0” sense). But with a growing involvement of data (and therefore data teams) in core business processes, decision making, and user-facing products, data-related incidents are increasingly common, and more important than ever.

Read Post

Incident.io

Read more about How our data team handles incidents

A Guide to Implementing Effective Incident Response Strategies

Jul 25, 2024 By OpsMatters In OpsMatters

It's not a matter of if an incident will happen. It's a matter of when. Check out this guide to incident response strategies before the next threat strikes.

Read Post

OpsMatters

Read more about A Guide to Implementing Effective Incident Response Strategies

Network topology: Definition and role in observability

Jul 25, 2024 By BigPanda In BigPanda

Network topology describes how a network‘s nodes, connections, and devices physically arrange and interconnect, as well as how they communicate. The arrangement or configuration of a network’s components plays a crucial role in ensuring smooth ITOps with minimum downtime. Any issues in the network can disrupt operations, leading to potentially dire consequences. To prevent this, you need to understand your network functionality and structure.

Read Post

BigPanda

Read more about Network topology: Definition and role in observability

Demo Roundups! Scale Support Teams with PagerDuty's CX Operations

Jul 25, 2024 By PagerDuty In PagerDuty

PagerDuty’s Solutions Consulting Team Lead Michael Aravopoulos presents an exclusive live demo showcasing PagerDuty's Customer Service Operations capabilities. Identify and address issues before they affect your customers Automate incident discovery and response to deliver streamlined digital experiences Facilitate communication and coordination between customer service and technical team.

View Video

PagerDuty

Incident Management

Read more about Demo Roundups! Scale Support Teams with PagerDuty's CX Operations

Effective Slack on-call protocols for engineers

Jul 25, 2024 By Falit Jain In Pagerly

Talks about being on call are usually met with complaints. Here's how to alter the narrative and develop a stronger, more compassionate process. A few years ago, I took oversight of a significant portion of our infrastructure. It was a complex undertaking that, if not managed and regulated properly, could have resulted in major disruptions and economic consequences over a large area.

Read Post

Pagerly

Read more about Effective Slack on-call protocols for engineers

Evaluating Opsgenie Alternatives in 2024

Jul 24, 2024 By Ritika Bramhe In OnPage

In today’s digital age, customer expectations are at an all-time high, with demands for instant support, flawless user experiences, and constant service availability. This environment of heightened expectations pushes organizations to innovate and streamline their operations continuously. Ensuring seamless service delivery hinges on the ability to detect and resolve issues swiftly, whether they are server crashes, software bugs, or unexpected outages.

Read Post

OnPage

Read more about Evaluating Opsgenie Alternatives in 2024

The Debrief: Debriefing on the Crowdstrike incident

Jul 24, 2024 By Incident.io In Incident.io

In this episode, Norberto (VP of Engineering) and Lawrence (Product Engineer) delve into the recent CrowdStrike incident that began on July 19th. Rather than focus on technical specifics, they provide a thoughtful exploration of key aspects that matter to us at incident.io, such as effective communication, overall response strategies, and proactive problem-solving during crises.

View Video

Incident.io

Incident Management

Read more about The Debrief: Debriefing on the Crowdstrike incident

Beyond MTTR: 7 incident metrics that matter and 3 that don't

Jul 24, 2024 By Ashley Sawatsky In Rootly

Pets.com was an online pet supply retailer founded in 1998, during the dot-com craze. In February 2000, it raised $83 million to go public based mainly on metrics like user acquisition, website traffic, and brand recognition. However, the profit margins were minimal and the marketing costs exorbitant, which led Pets.com to file for bankruptcy nine months after its IPO. The industry now recognizes these metrics as vanity metrics.

Read Post

Rootly

Read more about Beyond MTTR: 7 incident metrics that matter and 3 that don't

Execution Incident management on Slack

Jul 24, 2024 By Falit Jain In Pagerly

‍ ‍The article discusses streamlining on-call and incident management, focusing on the implementation of a new workflow. One key issue highlighted is the complexity of integrating various tools and platforms used for incident response, which can lead to fragmented communication and delayed resolutions. Another challenge is ensuring the efficiency of escalation protocols, where delays or missteps can impact response times.

Read Post

Pagerly

Read more about Execution Incident management on Slack

Transfer to the on-call using Slack

Jul 24, 2024 By Falit Jain In Pagerly

‍Handover for on-call schedules in this workflow can be problematic due to inconsistent communication and lack of clear documentation. Misunderstandings can occur when shifts change, leading to missed alerts or incomplete information being passed along. Relying solely on Slack can result in important details being buried in message threads, making it hard to track ongoing issues.

Read Post

Pagerly

Read more about Transfer to the on-call using Slack

Controlling vacation and paid time off with Slack

Jul 24, 2024 By Falit Jain In Pagerly

‍Managing PTO and vacation time in on-call workflows can lead to coverage issues, particularly when team sizes are small. Ensuring adequate coverage during local and global holidays can be complex, often requiring shifts to be swapped, which can disrupt team balance. Handling on-call duties during these periods may strain the available staff, potentially leading to fatigue and decreased effectiveness. Coordination and planning become crucial to maintain service reliability and avoid burnout.

Read Post

Pagerly

Read more about Controlling vacation and paid time off with Slack

Change the arrangement with Slack

Jul 24, 2024 By Falit Jain In Pagerly

Managing PTO and vacation time in on-call workflows faces several issues. Scheduling conflicts can arise when PTO requests overlap with critical on-call periods, leading to inadequate coverage. Automated systems may not always account for last-minute changes, causing potential gaps in availability. Coordination between HR, calendar systems, and on-call schedules can be complex, often resulting in miscommunication.

Read Post

Pagerly

Read more about Change the arrangement with Slack

Ticket management (Pagerduty, Jira, Slack, JSM) on Slack

Jul 24, 2024 By Falit Jain In Pagerly

The article addresses the integration of ticket administration across platforms like Jira, Slack, JSM (Jira Service Management), and PagerDuty to streamline on-call and incident management. However, a potential challenge with such integrations lies in maintaining consistency and synchronization across these disparate systems. Issues may arise from delays or discrepancies in updating ticket statuses between platforms, leading to confusion or duplication of efforts among teams.

Read Post

Pagerly

Read more about Ticket management (Pagerduty, Jira, Slack, JSM) on Slack

Alerts using Teams and Slack

Jul 24, 2024 By Falit Jain In Pagerly

Using Slack and Teams for alerts can lead to several issues. The sheer volume of notifications can overwhelm team members, causing critical alerts to be missed or ignored. Time zone differences can further complicate timely responses. Integrating alerts from multiple systems into these platforms may cause confusion and delay in identifying and addressing incidents.

Read Post

Pagerly

Read more about Alerts using Teams and Slack

Protocols for Transfer while using Slack

Jul 24, 2024 By Falit Jain In Pagerly

This article likely addresses challenges and considerations in implementing transfer protocols within an on-call and incident management workflow. Transfer protocols are crucial for ensuring the seamless handover of responsibilities and information between on-call personnel during shift changes or the escalation of incidents. Ensuring that all relevant details and context are effectively passed on helps prevent misunderstandings and delays in resolving critical issues.

Read Post

Pagerly

Read more about Protocols for Transfer while using Slack

Steps to AIOps maturity: Establish actionable incidents

Jul 24, 2024 By Katie Petrillo In BigPanda

Lack of communication between IT operations and ITSM teams results in data silos. And data silos make it challenging, if not impossible, to solve problems efficiently. One-third of ITOps professionals say that gathering business context is the biggest challenge to effective incident response and management, according to EMA Research.

Read Post

BigPanda

Read more about Steps to AIOps maturity: Establish actionable incidents

Enhancing Incident Collaboration: Jira Notes Now Integrated with Squadcast

Jul 23, 2024 By Rahul Jagdish In Squadcast

We're excited to share a significant improvement to our Jira integration aimed at enhancing your incident management workflow. With our latest update, you can now seamlessly sync notes between Jira tickets and Squadcast incidents. This bidirectional sync ensures that any comment added in one platform automatically appears in the other.

Read Post

Squadcast

Read more about Enhancing Incident Collaboration: Jira Notes Now Integrated with Squadcast

What's happening with ITSM in 2024?

Jul 23, 2024 By Conor Castronovo In BigPanda

The lines between IT service management (ITSM) and AIOps are blurring. The Gartner Hype Cycle for ITSM, 20241 discusses this exciting convergence. Traditionally, ITSM has focused on structured processes and best practices. AIOps brings valuable new capabilities to service management, including automation, correlation, machine learning, and real-time insights. This convergence augments established ITSM frameworks and processes rather than replace them.

Read Post

BigPanda

Read more about What's happening with ITSM in 2024?

BYO Payload: Custom event sources for Signals have landed

Jul 23, 2024 By Robert Ross In FireHydrant

Automated event payloads come in many shapes and sizes. These infinitely different event structures pose a problem for users who want to send them all to the same place to page on-call staff. Unless that on-call solution supports the schema directly, you’re out of luck. While we’re proud of the number of integrations we support today for event sources into on-call, we also think the best number that we should support is infinity.

Read Post

FireHydrant

Read more about BYO Payload: Custom event sources for Signals have landed

Evaluating PagerDuty Alternatives in 2024 (Updated)

Jul 23, 2024 By Ritika Bramhe In OnPage

We live in times of instant gratification, where customers expect same-day delivery, round-the-clock tech support, and seamless browsing experiences. Disruptive technologies and continuous innovation have raised expectations for faster and uninterrupted delivery of services. This shift is compelling organizations to adapt their operations to meet these new demands and stay competitive.

Read Post

OnPage

Read more about Evaluating PagerDuty Alternatives in 2024 (Updated)

xMatters Wonder Boy Release

Jul 23, 2024 By xMatters In xMatters

You’ll be resolving incidents faster than a caveman on a skateboard with our latest release, Wonder Boy! Since we don’t have any prehistoric monsters to deal with these days, we’re focusing on the future of incident management with a ton of exciting new features and updates packed into this release!

View Video

xMatters

Incident Management

Read more about xMatters Wonder Boy Release

Learning from Major Incidents: The Opportunities We're Missing

Jul 22, 2024 By Nora Jones In PagerDuty

While they are untimely, stressful and likely to highlight communication breakdowns within an organization; incidents can be a powerful tool for learning and growth in organizations. When an incident occurs with a large impact, which it feels like we read about this happening in the news on a weekly basis, oftentimes the focus is on two things: stabilizing the situation, and controlling the narrative. Organizations often miss the opportunity incidents present: learning.

Read Post

PagerDuty

Read more about Learning from Major Incidents: The Opportunities We're Missing

The Microsoft-CrowdStrike Outage: An In-Depth Analysis

Jul 22, 2024 By Raja Shekar Mulpuri In HEAL Software

On July 19, 2024, a significant outage impacted globally, causing widespread disruptions across various industries. This outage was primarily linked to a faulty update from CrowdStrike’s Falcon Sensor, which led to severe issues on Windows systems. CrowdStrike is a leading cybersecurity company that specializes in protecting businesses from online threats.

Read Post

HEAL Software

Read more about The Microsoft-CrowdStrike Outage: An In-Depth Analysis

ilert Call Flow: Turn Text into Speech Using AI

Jul 22, 2024 By iLert In iLert

A great new update to the ilert call flow: Turn text into speech using AI! Choose from various voice options to make your automatic responses sound natural and human-like. This feature is available for Voicemail, Audio messages, IVR menus, and PIN code nodes.

View Video

iLert

Read more about ilert Call Flow: Turn Text into Speech Using AI

Microsoft 365 Outage, MO821132: Users may be unable to access various Microsoft 365 apps and services

Jul 20, 2024 By Simon Dion In Exoprise

Thursday evening, Microsoft 365 identified a global outage affecting users accessing various Microsoft 365 applications and services. Impacted users suffered from login issues, Azure hosted virtual machines not being available, and constant loading screens in Microsoft 365 services, just to name some of the issues.

Read Post

Exoprise

Read more about Microsoft 365 Outage, MO821132: Users may be unable to access various Microsoft 365 apps and services

The IT Scramble is On with a Microsoft Outage: Incident MO821132 - July 18, 2024

Jul 19, 2024 By Sara Purdon In Martello Technologies

On July 18, 2024 at 6:38 pm ET, Vantage DX, Martello’s Microsoft 365 and Teams performance management solution, started to see indicators of a likely Microsoft outage impacting users’ ability to access various Microsoft 365 apps and services. Almost an hour later at 7:41 pm ET Microsoft issued a statement on X.

Read Post

Martello Technologies

Read more about The IT Scramble is On with a Microsoft Outage: Incident MO821132 - July 18, 2024

Global Microsoft Outage and Preventing Future Vulnerabilities

Jul 19, 2024 By Mishal Alam In uptime

In a recent unexpected turn of events, a faulty component in the latest CrowdStrike Falcon update led to widespread outages, crashing Windows systems globally. The repercussions were felt across various sectors, including airports, TV stations, hospitals, and even emergency services in the U.S. and Canada. The glitch, affecting both Windows workstations and servers, resulted in massive outages, bringing entire companies to a standstill and crashing fleets of hundreds of thousands of computers.

Read Post

uptime

Read more about Global Microsoft Outage and Preventing Future Vulnerabilities

July 19th global IT outage reminds us of digital complexity

Jul 19, 2024 By Dritan Suljoti In Catchpoint

As we write, on Friday July 19th, a massive global cyber outage is continuing to take down critical services around the world dependent on Microsoft-based computers.

Read Post

Catchpoint

Read more about July 19th global IT outage reminds us of digital complexity

Beyond the Headlines: The Unsung Art of Software Outage Management

Jul 19, 2024 By Robert Ross In FireHydrant

Today, the entire world is feeling the pain of a major software outage. While we know a lot about these occurrences—our entire business is built on helping companies manage incidents and outages effectively—we’re not here to share our opinion on it. Instead, we’d like to help those unfamiliar with the incident lifecycle understand what happens when an outage like this occurs, who is responsible for what, and what companies ultimately do to get things working again.

Read Post

FireHydrant

Read more about Beyond the Headlines: The Unsung Art of Software Outage Management

Learning Moment: Effective Customer Communication During Incidents - Enhance Visibility & Response with Uptime.com

Jul 19, 2024 By Jonathan Franconi In uptime

The recent global outage caused by an operating system update reminded me of how vulnerable we are today and most importantly, how close we are always teetering on global scale incidents with millions of interconnected dependencies. When the base of the house collapses, everything built on top is impacted. Those of us in IT Operations, Monitoring, Observability (insert the current acronym), etc., know firsthand this risk; we face it every day.

Read Post

uptime

Read more about Learning Moment: Effective Customer Communication During Incidents - Enhance Visibility & Response with Uptime.com

Integration Spotlight: PagerDuty and Robusta

Jul 19, 2024 By PagerDuty In PagerDuty

Bring powerful AI troubleshooting and cause analysis to your incident response with Robusta's integration with PagerDuty. Join us to learn more from CEO Natan Yellin on how your team can improve your k8s reliability.

View Video

PagerDuty

Read more about Integration Spotlight: PagerDuty and Robusta

A tough day for incident responders: lessons from the CrowdStrike update

Jul 19, 2024 By Stephen Whitworth In Incident.io

Today marks a particularly challenging day for incident responders across the globe. As many of you may have noticed, a recent update from CrowdStrike has triggered widespread disruptions, causing chaos in various sectors. The ripple effects have been far-reaching and severe: While the technical specifics of the issue might not be the focus here—and indeed, there are experts better suited to dissect the cause—what's crucial is understanding the impact on those who manage such crises.

Read Post

Incident.io

Read more about A tough day for incident responders: lessons from the CrowdStrike update

Nexthink Stops MS Outage From Hurting a Leading Consumer Goods Company

Jul 19, 2024 By Gaurang Ganatra In Nexthink

While individual blue screen errors are frustrating, the recent global system crashes caused by a CrowdStrike update incompatible with Microsoft Windows have wreaked havoc across entire industries since early Friday morning. Companies ranging from the airlines, media, and banking industries have been facing significant disruptions, with thousands of customer-facing devices experiencing blue screens and causing widespread travel delays and chaos.

Read Post

Nexthink

Read more about Nexthink Stops MS Outage From Hurting a Leading Consumer Goods Company

UptimeRobot Alerts Spike 5x Due to Microsoft/CrowdStrike Global Issues

Jul 19, 2024 By Tomas Koprusak In Uptime Robot

Given recent global events, UptimeRobot is experiencing an increased number of downtime notifications. We are currently sending out five times more notifications than usual due to a widespread power outage impacting several critical services worldwide. Here’s a brief overview of the situation and how it affects our monitoring services.

Read Post

Uptime Robot

Read more about UptimeRobot Alerts Spike 5x Due to Microsoft/CrowdStrike Global Issues

Incident vs Problem: What's the Difference?

Jul 18, 2024 By Ekaterina Glozshtein In Alloy Software

For the rest of the world, these are just two synonyms. But in ITIL, the main IT service management framework, the distinction is crucial. Let’s find out.

Read Post

Alloy Software

Read more about Incident vs Problem: What's the Difference?

Time, timezones, and scheduling

Jul 18, 2024 By Henry Course In Incident.io

Our On-call product has been in the wild for a few months now, and in this post I want to talk about building a time-sensitive system and what we did to handle some of the challenges. I’ll cover what our scheduler is responsible for, the basics of working with time, and talk a bit about how we tested our system.

Read Post

Incident.io

Read more about Time, timezones, and scheduling

What is ServiceOps?

Jul 17, 2024 By Sam Osborn In BigPanda

Service operations (ServiceOps) is a technology-enabled approach that unifies IT operations and IT service (ITSM) teams and facilitates frictionless collaboration for more effective incident management. ServiceOps combines people, processes, and technology to improve visibility, workflows, and collaboration between otherwise siloed departments. Organizations of all sizes and industries worldwide have adopted ServiceOps.

Read Post

BigPanda

Read more about What is ServiceOps?

Steps to AIOps maturity: Reducing alert noise

Jul 17, 2024 By Nathan Bao In BigPanda

In a previous post, we discussed initiating your AIOps journey. Now, we’re exploring best practices for progressing through each phase of the maturity journey. In these posts, we’ll focus on how to get started, proven techniques, key participants, and how to measure success.

Read Post

BigPanda

Read more about Steps to AIOps maturity: Reducing alert noise

The Impact of On-Call on Mental Health

Jul 17, 2024 By Zoe Collins In OnPage

Lately, I have been thinking about the mental health effects that stem from working in the cybersecurity industry. And in my research, I came across an Afternoon Cyber Tea podcast that sparked my interest. During their talk, host Ann Johnson and Dr. Ryan Louie, MD, PhD, dissect parallels between those who work in cybersecurity and those who work in healthcare, and uncover how these types of jobs affect mental health.

Read Post

OnPage

Read more about The Impact of On-Call on Mental Health

Automating SLO Management: Boost Efficiency, Accuracy, and Reliability

Jul 16, 2024 By Vishal Padghan In Squadcast

82% of organizations plan to increase their use of Service Level Objectives (SLOs), with 95% reporting that SLO adoption drives better business decisions, according to the Nobl9 2023 State of SLOs report. The traditional manual management of SLOs often results in inefficiencies and human errors, hindering productivity. Automating SLO management transforms these processes, enhancing accuracy and operational efficiency.

Read Post

Squadcast

Read more about Automating SLO Management: Boost Efficiency, Accuracy, and Reliability

The complexity of phone networks

Jul 16, 2024 By Leo Sjöberg In Incident.io

Arguably the most important part of an on-call product is knowing that you will be notified when things break, wherever you are. When it comes to SMS and phone call notifications, we have to leave the familiar realm of the internet and JSON responses, and deal with systems that provide limited observability and insight into what’s gone wrong.

Read Post

Incident.io

Read more about The complexity of phone networks

Squadcast leads the IT Alerting and Incident Management Landscape in G2's Summer 2024 Report

Jul 15, 2024 By Squadcast Community In Squadcast

Squadcast shines bright this summer, securing an impressive 38 badges across 95 reports, showcasing our IT Alerting and Incident Management leadership.

Read Post

Squadcast

Read more about Squadcast leads the IT Alerting and Incident Management Landscape in G2's Summer 2024 Report

What are event intelligence solutions?

Jul 15, 2024 By Conor Castronovo In BigPanda

As technology evolves, so does the language we use to describe it. Not surprisingly, IT operations have evolved dramatically since 2016. Given these changes and enhancements in artificial intelligence, the industry is overdue for an updated definition of AIOps platforms. AIOps isn’t going away, but we are changing some ways we talk about it. In the Gartner Hype Cycle for ITSM, 2024, Gartner announced new phrasing to describe the technology used in event management.

Read Post

BigPanda

Read more about What are event intelligence solutions?

Building a multi-platform on-call mobile app

Jul 15, 2024 By Rory Bain In Incident.io

A significant part of being on-call is the ability to respond to pages and handle escalations on the go. In the early stages of developing incident.io On-call, we considered whether a Minimum Viable Product (MVP) could rely solely on SMS and phone calls. However, we quickly realized that a fully featured mobile app was going to be essential to the on-call experience. This led us to the question: how should we build this mobile app?

Read Post

Incident.io

Read more about Building a multi-platform on-call mobile app

#7 Virtual Meetup EMEA Rundeck by PagerDuty OSS Community

Jul 15, 2024 By PagerDuty In PagerDuty

Automation is at full speed and back with #7 EMEA Rundeck by PagerDuty Meetup! Justyn Roberts, Sr. Solutions Consultant at PagerDuty, presents Learnings from the Last 3 Years: 10 Tips to Get the Best from Rundeck by PagerDuty. Follow us on Social.

View Video

PagerDuty

Read more about #7 Virtual Meetup EMEA Rundeck by PagerDuty OSS Community

Rootly Retrospectives Demo

Jul 15, 2024 By Rootly In Rootly

Post-incident learning made effortless. Rootly automates the retrospective process with customizable templates based on industry best practices.

View Video

Rootly

Read more about Rootly Retrospectives Demo

Short lived teams, sweating the details and zero bugs. How Linear raises the bar

Jul 13, 2024 By Incident.io In Incident.io

View Video

Incident.io

Incident Management

Read more about Short lived teams, sweating the details and zero bugs. How Linear raises the bar

Dear Customers, we couldn't have done it without you. With love, incident.io

Jul 12, 2024 By incident.io In Incident.io

We’re excited and honored (and might even be blushing a little) to share our Summer 2024 accolades from G2, including being ranked #1 in G2’s Relationship Index! There are several factors that go into determining this ranking, including: While all of these awards are special to us, Best Relationship means a lot because, well, our customers mean a lot.

Read Post

Incident.io

Read more about Dear Customers, we couldn't have done it without you. With love, incident.io

In a Flash, Create a Page or Incidents with Slack

Jul 10, 2024 By Falit Jain In Pagerly

In Slack, creating a page or incident is simple. To rapidly create an issue, hover over a message in Slack or use the /page command. ‍

Read Post

Pagerly

Read more about In a Flash, Create a Page or Incidents with Slack

Receive Reminders about Events with Slack

Jul 10, 2024 By Falit Jain In Pagerly

You can use Pagerly to establish reminders if you would like to be informed when there will be a rotation change or oncall handover.When a schedule change is about to occur, Pagerly will automatically recognise it and send out a notification to the team's Pagerly channel.

Read Post

Pagerly

Read more about Receive Reminders about Events with Slack

Mention the on-call situation in Slack channels

Jul 10, 2024 By Falit Jain In Pagerly

Get each team's current oncall automatically and tag them into any Slack topic. To bring up the current oncall in any channel, topic, or conversation, use @Pagerly. You may also use this to automate responses. ‍

Read Post

Pagerly

Read more about Mention the on-call situation in Slack channels

Decoding Severity: A Guide to Differentiating Major vs Critical Incidents

Jul 9, 2024 By Spandan Pal In Squadcast

Recognizing the difference between major and critical incidents is essential for IT operations, as downtime can result in significant financial losses for businesses. Gartner highlights that effective incident management can cut downtime by as much as 40% . Major incidents disrupt business operations but are typically confined to specific systems or processes.

Read Post

Squadcast

Read more about Decoding Severity: A Guide to Differentiating Major vs Critical Incidents

Behind the scenes: Launching On-call

Jul 9, 2024 By Henry Course In Incident.io

March 5th was a big day for incident.io as we released our on-call product to the world. Nine months of listening to our customers, coding, fixing, testing, and polishing came together for our biggest product launch to date. Releasing On-call was a huge milestone and represented the next step in our journey as a company.

Read Post

Incident.io

Read more about Behind the scenes: Launching On-call

Align ServiceOps with incident context to meet ITOps goals

Jul 9, 2024 By Sam Osborn In BigPanda

ServiceOps is a technology-enabled approach that unifies IT operations and IT service management (ITSM) teams to improve incident management. In a recent survey of more than 400 global IT leaders by Enterprise Management Associates (EMA), 96% of respondents reported positive results from implementing the approach. Adoption rates are high: 75% have either an active effort or a formal initiative to streamline collaboration between ITSM and ITOps teams.

Read Post

BigPanda

Read more about Align ServiceOps with incident context to meet ITOps goals

Round Robin escalation policies: do's and don'ts

Jul 9, 2024 By Ashley Sawatsky In Rootly

The concept of Round Robin comes from sports. And it has nothing to do with anyone called Robin, but the french word ruban (ribbon). In a Round Robin tournament, all participants face each other by taking turns. When applied to on-call schedules, a Round Robin escalation policy means that responders assigned to a level will take turns responding to alerts. When is this strategy useful and when isn’t?

Read Post

Rootly

Read more about Round Robin escalation policies: do's and don'ts

Part I: #3 Virtual Meetup Rundeck by PagerDuty Asia Pacific OSS Community.

Jul 8, 2024 By PagerDuty In PagerDuty

Part I:#3 Virtual Meetup Rundeck by PagerDuty Asia Pacific OSS Community. Customer Success Story: Samuel Kanagaraj (SRE Lead @ Telstra). Automate with Rundeck by PagerDuty! Explore the transformative power of automation through real-world success stories and expert insights. Hear firsthand from Samuel Kanagaraj, SRE Lead at Telstra, as he shares how automation has revolutionised their operations.

View Video

PagerDuty

Read more about Part I: #3 Virtual Meetup Rundeck by PagerDuty Asia Pacific OSS Community.

Part II: #3 Virtual Meetup Rundeck by PagerDuty Asia Pacific OSS Community.

Jul 8, 2024 By PagerDuty In PagerDuty

Part II:#3 Virtual Meetup Rundeck by PagerDuty Asia Pacific OSS Community. Customer Success Story: Jared Vern & Christopher Gadd (Automation Engineers @ One New Zealand). Automate with Rundeck by PagerDuty! Explore the transformative power of automation through real-world success stories and expert insights. Jared Vern and Christopher Gadd, Automation Engineers at One NZ, discuss their experiences and the impact of automation on their workflows.

View Video

PagerDuty

Read more about Part II: #3 Virtual Meetup Rundeck by PagerDuty Asia Pacific OSS Community.

Onboarding yourself as an engineer at incident.io

Jul 5, 2024 By Pip Taylor In Incident.io

At incident.io we use infrastructure as code for configuring everything we can, and we feel that there’s no reason we should exclude our own product from that. As well as configuring things like Google Cloud Platform, Sentry and Spacelift via our infrastructure repo, we also configure incident.io. On your first day as an engineer here, the first PR that you make is to our infrastructure repo.

Read Post

Incident.io

Read more about Onboarding yourself as an engineer at incident.io

Runbooks vs Playbooks: Differences & How to Choose

Jul 4, 2024 By Lauren Craigie In Cortex

Are you documenting your incident response process, and are unsure which you should be writing—a runbook or a playbook? Could these be two names for the same kind of document? Read on to learn about two different and complementary structures: playbooks and runbooks. The two are used in tandem, and because the terms are sometimes used interchangeably, they can be mistaken for one another.

Read Post

Cortex

Read more about Runbooks vs Playbooks: Differences & How to Choose

Live Call Routing with Squadcast: Helping Teams Achieve Faster Resolutions

Jul 4, 2024 By Squadcast In Squadcast

This is a recording of our webinar on how Squadcast's Live Call Routing is revolutionizing incident response for teams. In this informative session, you'll learn: The hidden costs of traditional incident reporting methods How a dedicated phone line streamlines incident communication Squadcast's easy-to-use, no-code setup for Live Call Routing Real-world case studies: See how companies have drastically improved their MTTR About Squadcast.

View Video

Squadcast

Read more about Live Call Routing with Squadcast: Helping Teams Achieve Faster Resolutions

Sync the Slack Usergroup with Oncall Schedule

Jul 4, 2024 By Falit Jain In Pagerly

Update users in the usergroup automatically with the oncall status and begin tagging with @sre-oncall. This can be integrated with your own custom rotation or the Pagerduty/OpsGenie schedule.

Read Post

Pagerly

Read more about Sync the Slack Usergroup with Oncall Schedule

On-Call Life: Setting Expectations

Jul 3, 2024 By Ritika Bramhe In OnPage

Imagine this: You’ve just been offered a new job in tech. Maybe it’s your first job right out of college, and you’ve only heard of being on-call in passing conversations up until this point. Or, perhaps you’ve been in tech your whole life but never had to be on-call until today. Or, maybe you’re contemplating whether on-call is for you because your company is dangling some extra cash (because, who doesn’t like extra money!).

Read Post

OnPage

Read more about On-Call Life: Setting Expectations

Two-way synchronisation Slack, JSM, and Jira

Jul 3, 2024 By Falit Jain In Pagerly

Synchronizing Jira, Jira Service Management (JSM), and Slack bidirectionally is complex due to differing data structures, permissions, and update frequencies, affecting real-time responsiveness and data consistency. Robust API integration and meticulous permission management are crucial for ensuring reliable synchronisation and secure data exchange, essential for effective cross-platform collaboration and efficiency. ‍

Read Post

Pagerly

Read more about Two-way synchronisation Slack, JSM, and Jira

What is a Runbook? Improve Efficiency and Incident Response

Jul 2, 2024 By Cortex In Cortex

IT complexity expands every year, nudging up the possibility of unforeseen consequences, such as outages and other disruptions to operations. Runbooks are a great way to ensure your team knows exactly what to do when incidents arise.

Read Post

Cortex

Read more about What is a Runbook? Improve Efficiency and Incident Response

How Meta and Google use AI to improve incident response

Jul 2, 2024 By JJ Tang In Rootly

The world population in 2024 is approximately 8.12 billion people. Of these, 4.3 billion people use Google regularly, while 3.74 billion are active users on Meta's platforms. Any disturbance involving these tech giants will surely make headlines, as seen in the recent Google’s Unisuper incident. The scale of these tech companies brings fascinating challenges in every aspect of their operations, including incident response.

Read Post

Rootly

Read more about How Meta and Google use AI to improve incident response

Operations | Monitoring | ITSM | DevOps | Cloud