SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Datadog on Site Reliability Engineering #shorts #datadog #observability

Apr 3, 2024 By Datadog In Datadog

There are many different ways to implement Site Reliability Engineering (SRE). From team structures to roles and responsibilities to planning and prioritization flows, there’s no golden path for how to organize things. As Datadog has shifted from a startup to a quickly-growing public company, we’ve seen our own SRE practice evolve. With over 22,000 customers sending trillions of data points each day, keeping Datadog reliable is critical to our business.

View Video

Datadog

Read more about Datadog on Site Reliability Engineering #shorts #datadog #observability

An SRE's Most Important Skill? Communication

Apr 3, 2024 By Nočnica Mellifera In Checkly

I wish someone had told me that I shouldn’t hop between frameworks. Just like learning four programming languages in your first year, in my experience spending time content switching as a beginner is wasted effort. If I’d spent a solid year learning how to deploy services on AWS, then when it was time to learn Azure, I’d see more similarities than differences and find it a lot easier to pick up a second public cloud.

Read Post

Checkly

Read more about An SRE's Most Important Skill? Communication

How Incidents Foster Leadership

Apr 3, 2024 By Zhuang (Strong) Liang In Rootly

To become battle-tested, you need to go through battles, not just read books or mentor newcomers. Both are helpful but the stakes are low. On the other hand, high stake jobs, such as running a big project or managing a team, are hard to get when you lack experience. So how can we solve this dilemma? Enter incident response.

Read Post

Rootly

Read more about How Incidents Foster Leadership

2024 SRE Report Insights: The Critical Role of Third-Party Monitoring in SRE

Apr 2, 2024 By Denton Chikura In Catchpoint

The 2024 SRE Report highlights a pivotal shift in how organizations approach the reliability and monitoring of their services, especially those that extend beyond their direct control. According to the report, 64% of organizations now recognize the importance of monitoring productivity or experience-disrupting endpoints, even beyond their physical control.

Read Post

Catchpoint

Read more about 2024 SRE Report Insights: The Critical Role of Third-Party Monitoring in SRE

Unleashing the Change Maker Within Webinar Preview

Apr 2, 2024 By Blameless In Blameless

Join us on April 16th at 10 a.m. PT for a 60-minute live webinar, where we'll discuss the secrets to driving change in your organization. We'll tackle two of reliability's biggest issues: getting budget and garnering support. Join us for Unleashing the Change Maker Within at 10 a.m. PST. We'll show you how to empower yourself to drive organizational change. Discover the secrets to selling your boss on the tools you need to automate your workflow and streamline your processes. We'll equip you with the strategies and insights to turn your great ideas into actionable plans.

View Video

Blameless

Read more about Unleashing the Change Maker Within Webinar Preview

Why and how to use site reliability golden signals

Apr 1, 2024 By Cortex In Cortex

Software complexity makes it harder for teams to rapidly identify and resolve issues. IT service management has evolved from an afterthought to a central part of DevOps. Microservices architectures are prone to delay or missed identification of such concerns. Monitoring mechanisms need to keep up with these complex infrastructures. Maintaining reliability and performance while harnessing this complexity requires a considered, data-driven approach.

Read Post

Cortex

Read more about Why and how to use site reliability golden signals

Future-Proofing IT Operations: Charter's Journey to Enhanced Reliability with Squadcast

Apr 1, 2024 By Squadcast In Squadcast

Discover the transformative journey of Charter, a leader in global IT services, towards achieving unmatched operational reliability through the strategic use of Squadcast in this insightful webinar recording. Chris Ardagh from Charter shares valuable insights and experiences, highlighting how advanced incident management practices with Squadcast have allowed the organization to redefine benchmarks in reliability engineering.

View Video

Squadcast

Read more about Future-Proofing IT Operations: Charter's Journey to Enhanced Reliability with Squadcast

Enterprise Incident Management: Guide & Best Practices

Mar 29, 2024 By Squadcast In Squadcast

In today's rapidly evolving technological landscape, incident management has become a critical discipline for enterprises to ensure uninterrupted operations and an optimal customer experience. Effective incident management involves a systematic approach to promptly detecting, responding to, and resolving incidents.

Read Post

Squadcast

Read more about Enterprise Incident Management: Guide & Best Practices

What are Blameless Retrospectives? How Do You Run Them?

Mar 29, 2024 By Lee Atchison In Blameless

In most engineering organizations, everyone agrees that in complex systems, failure is inevitable. It’s possible to prevent the recurrence of certain incidents, reduce their impact, or shorten the time to resolution. However, it’s impossible to avoid them altogether. In the past, we asserted failures are a result of people’s mistakes. It was all about “the bad apple theory,” focused on finding the “guilty party” and removing them to prevent future failures.

Read Post

Blameless

Read more about What are Blameless Retrospectives? How Do You Run Them?

Incident Response Team | Roles & Responsibilities Defined

Mar 29, 2024 By Lee Atchison In Blameless

When your organization faces outages, errors, security breaches, and other incidents, you need to have a plan in place to take appropriate actions as needed. However, you also need a capable team of experts filling critical roles and responsibilities to execute those actions and effectively collaborate to resolve issues quickly. An incident response team, therefore should be developed in a way that avoids skills gaps in expertise.

Read Post