March 2024

Enterprise Incident Management: Guide & Best Practices

Mar 29, 2024 By Squadcast In Squadcast

In today's rapidly evolving technological landscape, incident management has become a critical discipline for enterprises to ensure uninterrupted operations and an optimal customer experience. Effective incident management involves a systematic approach to promptly detecting, responding to, and resolving incidents.

Read Post

Squadcast

Read more about Enterprise Incident Management: Guide & Best Practices

What are Blameless Retrospectives? How Do You Run Them?

Mar 29, 2024 By Lee Atchison In Blameless

In most engineering organizations, everyone agrees that in complex systems, failure is inevitable. It’s possible to prevent the recurrence of certain incidents, reduce their impact, or shorten the time to resolution. However, it’s impossible to avoid them altogether. In the past, we asserted failures are a result of people’s mistakes. It was all about “the bad apple theory,” focused on finding the “guilty party” and removing them to prevent future failures.

Read Post

Blameless

Read more about What are Blameless Retrospectives? How Do You Run Them?

Incident Response Team | Roles & Responsibilities Defined

Mar 29, 2024 By Lee Atchison In Blameless

When your organization faces outages, errors, security breaches, and other incidents, you need to have a plan in place to take appropriate actions as needed. However, you also need a capable team of experts filling critical roles and responsibilities to execute those actions and effectively collaborate to resolve issues quickly. An incident response team, therefore should be developed in a way that avoids skills gaps in expertise.

Read Post

Blameless

Read more about Incident Response Team | Roles & Responsibilities Defined

Incident Management Automation - What You Should Know

Mar 29, 2024 By Lee Atchison In Blameless

Automated incident management is the process of automating incident response to ensure that critical events are detected and addressed in the most efficient and consistent manner. In incident management, time is of the essence and the primary benefit of automated incident management is speed. With automation, you can accomplish time-consuming tasks much quicker. This brings down the incident response time and allows the team to focus their attention on matters that require their expertise.

Read Post

Blameless

Read more about Incident Management Automation - What You Should Know

Uptime.com Webinar Series | Episode 3 | Why Every SRE and DevOps Beginner Needs A Status Page

Mar 27, 2024 By Uptime Website Monitoring In uptime

In this video, we discuss, the why every SRE and DevOps need to utilize a Status Page.

View Video

uptime

Read more about Uptime.com Webinar Series | Episode 3 | Why Every SRE and DevOps Beginner Needs A Status Page

Giving Power Back To The Engineers: A Fireside Chat with MyFitnessPal

Mar 26, 2024 By Blameless In Blameless

The real secret to mastering engineering operations is putting engineers in the driver's seat. On March 26th at 10 am, Chris Karper, Sr. Director of Engineering at MyFitnessPal, joins Chief Reliability Officer, Lee Atchison to discuss how MyFitnessPal is overcoming incidents by giving power back to the engineers. They'll explore how Chris has navigated MyFitnessPal through its technological advancements, growth of the team, and the maturity of its incident management program.

View Video

Blameless

Read more about Giving Power Back To The Engineers: A Fireside Chat with MyFitnessPal

Creating an Efficient IT Incident Management Plan: A Guide to Templates and Best Practices

Mar 22, 2024 By Vishal Padghan In Squadcast

In today's digitally-driven landscape, businesses rely heavily on their IT infrastructure to maintain operations smoothly. However, with this reliance comes the inevitability of encountering disruptions such as server outages, security breaches, or software malfunctions. Left unchecked, these incidents can have detrimental effects on productivity and revenue. This is where a well-designed Incident Management plan becomes indispensable.

Read Post

Squadcast

Read more about Creating an Efficient IT Incident Management Plan: A Guide to Templates and Best Practices

SLOs and Customer Experience: Uniting Engineering Excellence with Customer Satisfaction

Mar 21, 2024 By Vishal Padghan In Squadcast

In the contemporary landscape of fast paced IT and Digital services, where every click, tap, or swipe represents a potential interaction with a customer, the importance of optimizing the customer experience cannot be overstated. Service Level Objectives (SLOs) stand at the intersection of engineering excellence and customer satisfaction, serving as the guiding principles that drive the delivery of exceptional digital experiences.

Read Post

Squadcast

Read more about SLOs and Customer Experience: Uniting Engineering Excellence with Customer Satisfaction

Everything in software monitoring is dead, apparently

Mar 19, 2024 By Aniket Rao In Last9

Chasing shiny new toys, as always ;)

Read Post

Last9

Read more about Everything in software monitoring is dead, apparently

Boost Your Productivity with AI-Powered Tools

Mar 19, 2024 By Tiffany Cox In Rootly

Explore how AI-powered tools like Slack, Salesforce, Canva, and Rootly are revolutionizing the way we work.

Read Post

Rootly

Read more about Boost Your Productivity with AI-Powered Tools

Amplify Your Response Team's Impact: Introducing Squadcast's Additional Responders

Mar 18, 2024 By Vishal Padghan In Squadcast

At Squadcast, we're continually striving to empower our users with the tools they need to handle incidents swiftly and effectively. Today, we're thrilled to announce the launch of our latest feature: Additional Responders. This feature marks a significant step forward in enhancing collaboration and coordination during incident response.

Read Post

Squadcast

Read more about Amplify Your Response Team's Impact: Introducing Squadcast's Additional Responders

Optimizing On-Call for Incident Management: Preventing Team Burnout with Rootly On-Call

Mar 18, 2024 By Tiffany Cox In Rootly

Rootly On-Call streamlines incident management with automated scheduling, noise reduction, and centralized documentation. It mitigates on-call fatigue with features like flexible overrides, shift visibility, and shadow rotations, enhancing team well-being and preventing burnout.

Read Post

Rootly

Read more about Optimizing On-Call for Incident Management: Preventing Team Burnout with Rootly On-Call

Bob Lee - Lead DevOps Engineer at Twingate

Mar 15, 2024 By Shubham Srivastava In Zenduty

I was out there in sunny Austin this February, speaking at Civo Navigate 2024. The event was jam packed with amazing talks, and it was great meeting so many people with long and fascinating careers in engineering and Site Reliability. I had the privilege of meeting Bob Lee, who currently leads DevOps at Twingate — a cloud-based service that provides secured remote access, and poised to replace VPNs.

Read Post

Zenduty

Read more about Bob Lee - Lead DevOps Engineer at Twingate

Strategies for Scaling Systems Reliably by Bob Lee

Mar 15, 2024 By Shubham Srivastava In Zenduty

Read Post

Zenduty

Read more about Strategies for Scaling Systems Reliably by Bob Lee

ROI Demystified: A Deep Dive into What ROI Truly Means for Your Business

Mar 14, 2024 By Vishal Padghan In Squadcast

The term ROI (Return on Investment) often gets thrown around without a thorough understanding of its implications. Many see it merely as a financial metric, but in reality, ROI encompasses much more than monetary gains. In this comprehensive exploration, we delve into the true essence of ROI, its multifaceted nature, and how it impacts every aspect of your business strategy.

Read Post

Squadcast

Read more about ROI Demystified: A Deep Dive into What ROI Truly Means for Your Business

The Role of the SRE in the Incident Management Process

Mar 14, 2024 By Lee Atchison In Blameless

In the world of modern businesses, where IT systems play a major role in all types of businesses, the role of the Site Reliability Engineer (SRE) has become central to managing the effectiveness and reliability of the entire business. SREs are the bridge between the rapid deployment of software and systems and the stable operation of those systems in a production environment. They ensure that reliability and performance criteria are defined and are met.

Read Post

Blameless

Read more about The Role of the SRE in the Incident Management Process

From Deploy to Commit: Building the Ultimate Development Pipeline - A Comprehensive Guide

Mar 13, 2024 By Chitra Bisht In Squadcast

‘Manual deployment is (should be) a sin.’ Well, calling manual deployment a sin may sound strong, but consider this: building the ultimate development pipeline demands a focus on automation. Although the selection of a deployment method depends on the specific needs and requirements of a project or environment, can you really deny the power of automated deployment? There's a better way.

Read Post

Squadcast

Read more about From Deploy to Commit: Building the Ultimate Development Pipeline - A Comprehensive Guide

How Squadcast's Snooze Incidents Promotes Focussed On Call Shifts

Mar 12, 2024 By Chitra Bisht In Squadcast

Dealing with a flood of incidents, each with varying degrees of urgency, can be a daily struggle for Incident Response teams. Suppose a low-priority alert pings while you're tackling a critical incident. This pulls your focus away from the urgent issue. This constant alert bombardment can: How do engineers ensure that high-severity issues take precedence? Don't they want to avoid being bothered or bombarded with notifications while addressing critical matters? They sure do.

Read Post

Squadcast

Read more about How Squadcast's Snooze Incidents Promotes Focussed On Call Shifts

Introducing six Rootly AI features: focus on the incident, leave the paperwork to us

Mar 12, 2024 By JJ Tang In Rootly

Say hello to smarter incident management with smart summaries, mitigation message suggestions, and our new conversational assistant! 🚀✨

Read Post

Rootly

Read more about Introducing six Rootly AI features: focus on the incident, leave the paperwork to us

IT Incidents and the Role of Incident Response Teams (IRTs)

Mar 11, 2024 By Anjali Udasi In Zenduty

The digital world comes with advantages and inherent risks. These IT incidents, which can encompass cyberattacks, system outages, and data breaches, can have a devastating impact. Beyond financial losses, IT incidents disrupt business operations, damage reputations, and erode customer trust. During an outage, having a well-prepared Incident Response Team (IRT) is essential to reduce downtime and improve response times.

Read Post

Zenduty

Read more about IT Incidents and the Role of Incident Response Teams (IRTs)

Software Monitoring - Stuck in the 00s

Mar 8, 2024 By Piyush Verma In Last9

A short history of software monitoring, from the 00s. What has changed? Why are things so arcane?

Read Post

Last9

Read more about Software Monitoring - Stuck in the 00s

Next-Gen Incident Management: Blueprints for High-Powered Incident Response

Mar 8, 2024 By Blameless In Blameless

Join us for an exclusive webinar designed for IT Operations leaders, SREs, DevOps & software engineering leaders, featuring Jim Gochee, CEO of Blameless, Ken Gavranovic, COO of Blameless, and Nick Mason, Principal Sales Engineer at Blameless. Uncover the technical scaffolding essential to propel your incident management strategy forward, faster. Dive deep into the core technical components vital for a robust incident response framework, and discover firsthand how Generative AI can dramatically save hours for your team during critical incidents.

View Video

Blameless

Read more about Next-Gen Incident Management: Blueprints for High-Powered Incident Response

5 Easy Ways to Reduce Work-Related Stress for SRE Professionals

Mar 6, 2024 By Tiffany Cox In Rootly

It's completely normal to feel a little overwhelmed and stressed out at work these days. Technology has collaboration moving at the speed of light, and time away from screens is at an all-time low, blurring the lines between work and personal time. Plus, it's hard to ignore the multitude of tech outages that have been making headlines lately, leaving teams anxiously on edge. When you are a professional with on-call cycles, the potential of outages adds another level of complexity to the mix.

Read Post

Rootly

Read more about 5 Easy Ways to Reduce Work-Related Stress for SRE Professionals

Our Schedules Just Got Better for You!

Mar 6, 2024 By Menahi Shayan In Zenduty

We’ve come a long way from the first On-call schedule editor we built, and this new release takes the user experience for Zenduty Schedules a hundred steps further! (…103 steps further precisely, says our internal changelog)

Read Post

Zenduty

Read more about Our Schedules Just Got Better for You!

The Role of APM in DevOps and SRE Practices

Mar 5, 2024 By Keren Feldsher In Coralogix

As the software development world becomes faster, enterprises must adapt to customer demands by increasing their application’s deployment frequency. They often rely on DevOps and Site Reliability Engineering (SRE) methodologies to achieve this. These approaches ensure high system availability amidst frequent deployments and prioritize delivering a seamless user experience.

Read Post

Coralogix

Read more about The Role of APM in DevOps and SRE Practices

Trade-off Between Reliability and Feature Velocity

Mar 1, 2024 By Anjali Udasi In Zenduty

The pressure to constantly innovate and release new features can often clash with the need for a stable and reliable product. While there might be some temporary cutbacks in testing time to achieve high feature velocity, ensuring reliability doesn't have to be an afterthought. We reached out to industry experts to gather their insights on ensuring reliability during phases that demand high feature velocity. Here's what they had to say.

Read Post

Zenduty

Read more about Trade-off Between Reliability and Feature Velocity

How do you build resilient systems to manage the IPL with 30+ million concurrent users?

Mar 1, 2024 By Last9 In Last9

The Indian Premier League is a unique sporting event for a dozen reasons. But for engineers in India, it’s one of a kind. Very few companies can boast of managing 30+ million concurrent users. Every year, this number grows. Last year, we witnessed ~60 million concurrent users. And things get bigger and larger every year.

View Video

Last9

Read more about How do you build resilient systems to manage the IPL with 30+ million concurrent users?

Operations | Monitoring | ITSM | DevOps | Cloud

March 2024

Enterprise Incident Management: Guide & Best Practices

What are Blameless Retrospectives? How Do You Run Them?

Incident Response Team | Roles & Responsibilities Defined

Incident Management Automation - What You Should Know

Uptime.com Webinar Series | Episode 3 | Why Every SRE and DevOps Beginner Needs A Status Page

Giving Power Back To The Engineers: A Fireside Chat with MyFitnessPal

Creating an Efficient IT Incident Management Plan: A Guide to Templates and Best Practices

SLOs and Customer Experience: Uniting Engineering Excellence with Customer Satisfaction

Everything in software monitoring is dead, apparently

Boost Your Productivity with AI-Powered Tools

Amplify Your Response Team's Impact: Introducing Squadcast's Additional Responders

Optimizing On-Call for Incident Management: Preventing Team Burnout with Rootly On-Call

Bob Lee - Lead DevOps Engineer at Twingate

Strategies for Scaling Systems Reliably by Bob Lee

ROI Demystified: A Deep Dive into What ROI Truly Means for Your Business

The Role of the SRE in the Incident Management Process

From Deploy to Commit: Building the Ultimate Development Pipeline - A Comprehensive Guide

How Squadcast's Snooze Incidents Promotes Focussed On Call Shifts

Introducing six Rootly AI features: focus on the incident, leave the paperwork to us

IT Incidents and the Role of Incident Response Teams (IRTs)

Software Monitoring - Stuck in the 00s

Next-Gen Incident Management: Blueprints for High-Powered Incident Response

5 Easy Ways to Reduce Work-Related Stress for SRE Professionals

Our Schedules Just Got Better for You!

The Role of APM in DevOps and SRE Practices

Trade-off Between Reliability and Feature Velocity

How do you build resilient systems to manage the IPL with 30+ million concurrent users?

Monthly Archive

Follow Us