Latest Posts

What is Reliability Management?

Oct 20, 2022 By Andre Newman In Gremlin

Measuring and improving the reliability of technical systems has always been challenging. As an industry, we've developed several practices to try and address reliability concerns, such as incident response, observability, and Chaos Engineering. This led SREs and service owners to measure reliability in a handful of ways.

Read Post

Gremlin

Read more about What is Reliability Management?

Setting better SLOs using Google's Golden Signals

Oct 11, 2022 By Andre Newman In Gremlin

To many engineers, the idea that you can accurately and comprehensively track your application's user experience using just a few simple metrics might sound far-fetched. Believe it or not, there are four metrics that aim to do just that. They're called the four Golden Signals and should be a core part of your observability and reliability practices.

Read Post

Gremlin

Read more about Setting better SLOs using Google's Golden Signals

What is a "service" in a microservices architecture?

Sep 2, 2022 By Andre Newman In Gremlin

The past ten years marked a significant change in how software teams build and deploy applications. We moved away from bulky, slow, monolithic applications toward lightweight, scalable, distributed service-based applications. Meanwhile, tools like Docker, Kubernetes, and other container platforms helped accelerate this process. Despite this sudden growth, a fundamental question remains: what exactly is a service, and how does it fit into a microservice architecture?

Read Post

Gremlin

Read more about What is a "service" in a microservices architecture?

What are the four Golden Signals?

Sep 2, 2022 By Andre Newman In Gremlin

When it comes to building reliable and scalable software, few organizations have as much authority and expertise as Google. Their Site Reliability Engineering Handbook, first published in 2016, details their practices to maintain reliability as Google scaled. But when you have over a million servers running thousands of services across more than twenty data centers, how do you monitor them in a consistent, logical, and relevant way?

Read Post

Gremlin

Read more about What are the four Golden Signals?

Four tests to measure and improve reliability: what matters and how it works

Sep 2, 2022 By Andre Newman In Gremlin

Legendary race car driver Carroll Smith once said, "until we have established reliability, there is no sense at all in wasting time trying to make the thing go faster." Even though he was referring to cars, the same goes for technology: no amount of code optimization or new features can replace stable systems. Unfortunately, much like race cars, it's hard to know that a system is unreliable until it blows a tire, the brakes stop working, or the steering wheel comes off the column.

Read Post

Gremlin

Read more about Four tests to measure and improve reliability: what matters and how it works

How to define and measure the reliability of a service

Jul 14, 2022 By Andre Newman In Gremlin

More and more teams are moving away from monolithic applications and towards microservice-based architectures. As part of this transition, development teams are taking more direct ownership over their applications, including their deployment and operation in production. A major challenge these teams face isn't in getting their code into production (we have containers to thank for that), but in making sure their services are reliable.

Read Post

Gremlin

Read more about How to define and measure the reliability of a service

How Gremlin's reliability score works

Jul 14, 2022 By Andre Newman In Gremlin

In order to make reliability improvements tangible, there needs to be a way to quantify and track the reliability of systems and services in a meaningful way. This "reliability score" should indicate at a glance how likely a service is to withstand real-world causes of failure without having to wait for an incident to happen first. Gremlin's upcoming feature allows you to do just that.

Read Post

Gremlin

Read more about How Gremlin's reliability score works

Chaos Engineering Tools: Build vs Buy

Jul 8, 2022 By Gremlin In Gremlin

Chaos Engineering, where engineers intentionally inject failure to test the reliability of their systems, is becoming a regular practice for companies who value uptime and availability. As cloud-based systems have grown more complex, Chaos Engineering has become a critical part of the software testing and release process to uncover surprise dependencies, fix problems before they become 3am outages, and bake reliability into every feature.

Read Post

Gremlin

Read more about Chaos Engineering Tools: Build vs Buy

Podcast: Break Things on Purpose | Developer Advocacy and Innersource with Aaron Clark

Jun 14, 2022 By Jason Yee In Gremlin

In this episode, Jason chats with Aaron Clark, Director of Developer Advocacy at the Royal Bank of Canada. Aaron shares what it was like starting out as a developer at RBC and working in early cloud development, and then transitioning to his role as a developer advocate. Jason and Aaron talk about the value applying open source principles within organizations, or “innersource.” Their time ends with a discussion on continuing education and how to keep learning.

Read Post

Gremlin

Read more about Podcast: Break Things on Purpose | Developer Advocacy and Innersource with Aaron Clark

Gartner: tips for improving reliability

Jun 6, 2022 By Andre Newman In Gremlin

In their report titled “IT Resilience — 7 Tips for Improving Reliability, Tolerability and Disaster Recovery”, Gartner presents seven strategies for improving the resilience posture of your critical systems. These recommendations range from how to get started, to identifying IT hazards and risks to reliability, to capturing metrics and translating them into business value. In this blog, we’ll take a high-level look at the report and summarize some of its key findings.

Read Post

Gremlin

Read more about Gartner: tips for improving reliability

Operations | Monitoring | ITSM | DevOps | Cloud

What is Reliability Management?

Setting better SLOs using Google's Golden Signals

What is a "service" in a microservices architecture?

What are the four Golden Signals?

Four tests to measure and improve reliability: what matters and how it works

How to define and measure the reliability of a service

How Gremlin's reliability score works

Chaos Engineering Tools: Build vs Buy

Podcast: Break Things on Purpose | Developer Advocacy and Innersource with Aaron Clark

Gartner: tips for improving reliability

Monthly Archive

Follow Us