SRE

The latest News and Information on Service Reliability Engineering and related technologies.

What is clinical troubleshooting? #incidentmanagement #incidentresponse #sitereliabilityengineering

May 8, 2024 By Incident.io In Incident.io

In this clip, Dan Slimmons explains what this clinical troubleshooting framework entails. It’s no secret that teamwork is one of those things that, when done right, can make a world of a difference. So sometimes, when responding to a particularly complicated incident, it can be best to bring a team together to figure out what’s going on and work towards a fix. But it’s not enough to just jam a bunch of folks into a room and hope for the best. You need a framework in place to ensure that everyone stays focused, diagnoses the issue and resolves it as quickly as possible.

View Video

Incident.io

Read more about What is clinical troubleshooting? #incidentmanagement #incidentresponse #sitereliabilityengineering

Learning is an iterative process #incidentmanagement #incidentresponse #sitereliabilityengineering

May 8, 2024 By Incident.io In Incident.io

In this clip, Viktor Stanchev explains why it's important to remember that learning is an iterative process. Whether you’re a seasoned vet when it comes to incident response, or just getting started out, it can be easy to fall into the trap of doing too much all at once. And it just makes sense. Incident response is one of those things that doesn’t have a single, perfect formula, so teams can be left doing a little bit of everything in an effort to get it right.

View Video

Incident.io

Read more about Learning is an iterative process #incidentmanagement #incidentresponse #sitereliabilityengineering

It's better to declare incidents early #incidentmanagement #sitereliabilityengineering

May 8, 2024 By Incident.io In Incident.io

In this clip, Viktor Stanchev explains why it's better to declare incidents early rather than too late. Whether you’re a seasoned vet when it comes to incident response, or just getting started out, it can be easy to fall into the trap of doing too much all at once. And it just makes sense. Incident response is one of those things that doesn’t have a single, perfect formula, so teams can be left doing a little bit of everything in an effort to get it right.

View Video

Incident.io

Read more about It's better to declare incidents early #incidentmanagement #sitereliabilityengineering

Elastic's RAG-based AI Assistant: Analyze application issues with LLMs and private GitHub issues

May 8, 2024 By Bahubali Shetti In Elastic

As an SRE, analyzing applications is more complex than ever. Not only do you have to ensure the application is running optimally to ensure great customer experiences, but you must also understand the inner workings in some cases to help troubleshoot. Analyzing issues in a production-based service is a team sport. It takes the SRE, DevOps, development, and support to get to the root cause and potentially remediate. If it's impacting, then it's even worse because there is a race against time.

Read Post

Elastic

Read more about Elastic's RAG-based AI Assistant: Analyze application issues with LLMs and private GitHub issues

Advanced Incident Management Strategies for Engineers

May 7, 2024 By Chitra Bisht In Squadcast

The business world is in constant flux, and the way we handle Incident Management (IM) needs to evolve alongside it. Incidents come in all priorities and urgencies, and while some can be addressed with any planning, others are simply unpredictable. That's why businesses can't afford to be caught off guard. The potential consequences of such incidents for businesses have never been greater. A single event can disrupt operations, damage reputations, and result in significant financial losses.

Read Post

Squadcast

Read more about Advanced Incident Management Strategies for Engineers

Remote Team Rotations: On-Call Across Timezones

May 3, 2024 By Jorge Lainfiesta In Rootly

Use the different timezones and varied needs of your team to schedule on-call rotations that make everyone happy.

Read Post

Rootly

Read more about Remote Team Rotations: On-Call Across Timezones

Automation Triumphs Real-World DevOps Automation Implementations

Apr 30, 2024 By Chitra Bisht In Squadcast

Remember the pre-automation days in DevOps? Endless server configurations, manual deployments that took hours (or days!), and a constant feeling of being buried in repetitive tasks. Yeah, those were the times... �� Thankfully, those days are fading fast. The magic of automation has swept through the DevOps landscape, transforming tedious workflows into streamlined processes.

Read Post

Squadcast

Read more about Automation Triumphs Real-World DevOps Automation Implementations

Reinventing Deployments: From Docker to Dagger -- Incidentally Reliable with Solomon Hykes

Apr 30, 2024 By Zenduty In Zenduty

Catch Solomon Hykes (Co-founder of @Docker and @Dagger) shares stories from the early days of Docker, the rollercoaster journey leading to 20 million active developers worldwide, the heavy crown of a tech leader and his vision to revolutionize CI/CD with Dagger today. Exclusively on The Incidentally Reliable podcast — made by SREs for SREs, hosted by Zenduty.

View Video

Zenduty

Read more about Reinventing Deployments: From Docker to Dagger -- Incidentally Reliable with Solomon Hykes

Elevating Engineering Excellence: The Imperative of Site Reliability for Every Engineer

Apr 29, 2024 By Vishal Padghan In Squadcast

In the ever-evolving landscape of technology, engineers are the architects of the digital world. Their expertise shapes the platforms, applications, and services that define our daily interactions with technology. Yet, in the pursuit of innovation and functionality, there's one crucial aspect that often takes a backseat—site reliability. Site reliability engineering (SRE) has emerged as a critical discipline in the realm of software development and operations.

Read Post