Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

What are Blameless Retrospectives? How Do You Run Them?

In most engineering organizations, everyone agrees that in complex systems, failure is inevitable. It’s possible to prevent the recurrence of certain incidents, reduce their impact, or shorten the time to resolution. However, it’s impossible to avoid them altogether. In the past, we asserted failures are a result of people’s mistakes. It was all about “the bad apple theory,” focused on finding the “guilty party” and removing them to prevent future failures.

Incident Response Team | Roles & Responsibilities Defined

When your organization faces outages, errors, security breaches, and other incidents, you need to have a plan in place to take appropriate actions as needed. However, you also need a capable team of experts filling critical roles and responsibilities to execute those actions and effectively collaborate to resolve issues quickly. An incident response team, therefore should be developed in a way that avoids skills gaps in expertise.

Incident Management Automation - What You Should Know

Automated incident management is the process of automating incident response to ensure that critical events are detected and addressed in the most efficient and consistent manner. In incident management, time is of the essence and the primary benefit of automated incident management is speed. With automation, you can accomplish time-consuming tasks much quicker. This brings down the incident response time and allows the team to focus their attention on matters that require their expertise.

Giving Power Back To The Engineers: A Fireside Chat with MyFitnessPal

The real secret to mastering engineering operations is putting engineers in the driver's seat. On March 26th at 10 am, Chris Karper, Sr. Director of Engineering at MyFitnessPal, joins Chief Reliability Officer, Lee Atchison to discuss how MyFitnessPal is overcoming incidents by giving power back to the engineers. They'll explore how Chris has navigated MyFitnessPal through its technological advancements, growth of the team, and the maturity of its incident management program.

Creating an Efficient IT Incident Management Plan: A Guide to Templates and Best Practices

In today's digitally-driven landscape, businesses rely heavily on their IT infrastructure to maintain operations smoothly. However, with this reliance comes the inevitability of encountering disruptions such as server outages, security breaches, or software malfunctions. Left unchecked, these incidents can have detrimental effects on productivity and revenue. This is where a well-designed Incident Management plan becomes indispensable.

SLOs and Customer Experience: Uniting Engineering Excellence with Customer Satisfaction

In the contemporary landscape of fast paced IT and Digital services, where every click, tap, or swipe represents a potential interaction with a customer, the importance of optimizing the customer experience cannot be overstated. Service Level Objectives (SLOs) stand at the intersection of engineering excellence and customer satisfaction, serving as the guiding principles that drive the delivery of exceptional digital experiences.

Amplify Your Response Team's Impact: Introducing Squadcast's Additional Responders

At Squadcast, we're continually striving to empower our users with the tools they need to handle incidents swiftly and effectively. Today, we're thrilled to announce the launch of our latest feature: Additional Responders. This feature marks a significant step forward in enhancing collaboration and coordination during incident response.