July 2021

The Unique Reliability Engineering Requirements of Microservices

Jul 30, 2021 By JJ Tang In Rootly

Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.

Read Post

Rootly

Read more about The Unique Reliability Engineering Requirements of Microservices

What are the Four Golden Signals?

Jul 29, 2021 By Blameless In Blameless

SRE’s Golden Signals are four key metrics used to monitor the health of your service and underlying systems. We will explain what they are, and how they can help you improve service performance.

Read Post

Blameless

Read more about What are the Four Golden Signals?

Most frequently asked questions surrounding Google's Cloud Operations Sandbox

Jul 29, 2021 By Nir Sharma In Squadcast

Cloud Operations Sandbox serves as a simulation tool for budding SREs to learn the best practices from Google and apply them to real cloud services. In this blog, we have compiled a list of FAQs surrounding the use of Google's Cloud Operations Sandbox. The Google SRE sandbox provides an easy way to get started with the core skills you need to become a SRE.

Read Post

Squadcast

Read more about Most frequently asked questions surrounding Google's Cloud Operations Sandbox

Reliability Matters. Blameless is Growing with Series B $30M Funding

Jul 27, 2021 By Lyon Wong In Blameless

When Blameless started in 2018, the team set out on a mission to help all engineers achieve reliability with less toil and risk. Three years in, that mission has become more important than ever. What has changed is the rate of SRE adoption, now the fastest growing team and practice inside engineering. This represents a clear recognition of the many upsides that an SRE practice brings with its combination of continuous learning, velocity, and resilience.

Read Post

Blameless

Read more about Reliability Matters. Blameless is Growing with Series B $30M Funding

How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

Jul 26, 2021 By LogDNA In Mezmo

Site Reliability Engineering (SRE) and Operations (Ops) teams heavily rely on notifications. We use them to know what’s going on with application workloads and how applications are performing. Notifications are critical to ensuring SREs and Ops teams can resolve errors and reduce downtime. They’re also crucial when monitoring environments — not only when running in production but also during the dev-test or staging phase.

Read Post

Mezmo

Read more about How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

What's the Difference between Observability and Monitoring?

Jul 21, 2021 By Blameless In Blameless

Wondering what the difference is between observability and monitoring? In this post, we explain how they are related, why they are important, and some suggested tools that can help. The difference between observability and monitoring is that observability is the ability to understand a system’s state from its outputs, often referred to as understanding the “unknown unknowns”.

Read Post

Blameless

Read more about What's the Difference between Observability and Monitoring?

When You Do DevSecOps, Don't Forget the SREs

Jul 21, 2021 By Quentin Rousseau In Rootly

It's time to break down the silos separating SREs from security engineers.

Read Post

Rootly

Read more about When You Do DevSecOps, Don't Forget the SREs

SRE's Guide to Chaos & Observability

Jul 20, 2021 By Gremlin In Gremlin

Today’s distributed, cloud-based environments are incredibly complex. Not only does each component depend on many others, but modern systems are also highly dynamic—changing frequently as teams push new code or make updates to infrastructure. Taming this complexity to ensure reliability requires end-to-end observability to understand how components depend on each other. Additionally, proactive Chaos Engineering combined with AI-driven observability lets you uncover “unknown unknowns” that impact how your system will respond to different failure scenarios.

View Video

Gremlin

Read more about SRE's Guide to Chaos & Observability

Upcoming trends in DevOps and SRE

Jul 15, 2021 By Biju Chacko In Squadcast

DevOps and SRE are domains with rapid growth and frequent innovations. With this blog you can explore the latest trends in DevOps, SRE and stay ahead of the curve. The past decade has seen widespread adoption of DevOps methodologies in software development. Unsurprisingly, as the needs of users change, DevOps techniques have evolved as well. In this blog we will look at the trends that are most likely to have a significant impact in the coming years.

Read Post

Squadcast

Read more about Upcoming trends in DevOps and SRE

De-Siloing Incident Management: How to Make Reliability Engineering Everyone's Job

Jul 15, 2021 By JJ Tang In Rootly

4 best practices for breaking down silos and establishing a culture of shared responsibility toward reliability.

Read Post

Rootly

Read more about De-Siloing Incident Management: How to Make Reliability Engineering Everyone's Job

Pragmatic Incident Response: 3 Lessons Learned from Failures

Jul 15, 2021 By Robert Ross In FireHydrant

In my past experience as an SRE I’ve learned some valuable lessons about how to respond and learn from incidents. Declare and run retros for the small incidents. It's less stressful, and action items become much more actionable. Decrease the time it takes to analyze an incident. You'll remember more, and will learn more from the incident. Alert on pain felt by people — not computers. The only reason we declare incidents at all is because of the people on the other side of them.

Read Post

FireHydrant

Read more about Pragmatic Incident Response: 3 Lessons Learned from Failures

What is a Blameless Postmortem?

Jul 13, 2021 By Noor-ul-Anam Ruqayya In Blameless

Do blameless retrospectives (or postmortems) help your team? We will explain what they are, if they really work, and how to do them right. A blameless postmortem (or retrospective) is a post-incident document that helps teams figure out why an incident happened, and brainstorm how to improve the process to prevent similar incidents from happening again. In most engineering organizations, everyone agrees that in complex systems, failure is inevitable.

Read Post

Blameless

Read more about What is a Blameless Postmortem?

Error Budgets That Work for You. Plus Support for New Relic Metrics and NR Query Language

Jul 8, 2021 By Blameless Community In Blameless

Error Budgets That Work for You. Plus Support for New Relic Metrics and NR Query Language Did you know that error budget policy is the key to making SLOs actionable? In fact, Twitter’s engineering team did not successfully adopt SLOs until they introduced error budgets. SLOs enable teams to quantify customer happiness, and error budgets enable teams to make data-backed tradeoffs between reliability and feature velocity. We believe that teams optimizing for reliability must adopt both.

Read Post

Blameless

Read more about Error Budgets That Work for You. Plus Support for New Relic Metrics and NR Query Language

Rootly Announces $3.2 Million in Seed Funding from XYZ Venture Capital, 8VC, & Y Combinator

Jul 8, 2021 By Quentin Rousseau In Rootly

Rootly is on a mission to create a world where maintaining reliability is frictionless, delightful, and accessible to anyone. Making resolving and learning from incidents every organizations superpower.

Read Post

Rootly

Read more about Rootly Announces $3.2 Million in Seed Funding from XYZ Venture Capital, 8VC, & Y Combinator

The Incident Review: 4 Incidents in Outer Space

Jul 6, 2021 By JJ Tang In Rootly

From network problems to computer failures, a variety of incidents can disrupt operations for systems in outer space.

Read Post

Rootly

Read more about The Incident Review: 4 Incidents in Outer Space

Elephant in the Blameless War Room: Accountability

Jul 1, 2021 By Christina Tan In Blameless

We’ve always advocated that every company can benefit from a blameless culture . Fostering a blameless culture can profoundly boost your organization in powerful ways, from employee retention to developer velocity and innovation. However, there’s an elephant in the room when we talk about blamelessness with executives: accountability. When things go wrong, people still need to get fired, right?

Read Post

Blameless

Read more about Elephant in the Blameless War Room: Accountability

Operations | Monitoring | ITSM | DevOps | Cloud

July 2021

The Unique Reliability Engineering Requirements of Microservices

What are the Four Golden Signals?

Most frequently asked questions surrounding Google's Cloud Operations Sandbox

Reliability Matters. Blameless is Growing with Series B $30M Funding

How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

What's the Difference between Observability and Monitoring?

When You Do DevSecOps, Don't Forget the SREs

SRE's Guide to Chaos & Observability

Upcoming trends in DevOps and SRE

De-Siloing Incident Management: How to Make Reliability Engineering Everyone's Job

Pragmatic Incident Response: 3 Lessons Learned from Failures

What is a Blameless Postmortem?

Error Budgets That Work for You. Plus Support for New Relic Metrics and NR Query Language

Rootly Announces $3.2 Million in Seed Funding from XYZ Venture Capital, 8VC, & Y Combinator

The Incident Review: 4 Incidents in Outer Space

Elephant in the Blameless War Room: Accountability

Monthly Archive

Follow Us