Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

SRE: From Theory to Practice | What's difficult about incident command

A few weeks ago we released episode two of our ongoing webinar series, SRE: From Theory to Practice. In this series, we break down a challenge facing SREs through an open and honest discussion. Our topic this episode was “what’s difficult about incident command?” When things go wrong, who is in charge? And what does it feel like to do that role?

A Chat with Lex Neva of SRE Weekly

Since 2015, Lex Neva has been publishing SRE Weekly. If you’re interested enough in reading about SRE to have found this post, you’re probably familiar with it. If not, there’s a lot of great articles to catch up on! Lex selects around 10 entries from across the internet for each issue, focusing on everything from SRE best practices to the socio- side of systems to major outages in the news. ‍ I had always figured Lex must be among the most well-read people in SRE, and likely #1.

How The Experts Build Reliable Cloud Apps

We live in the cloud era, where your services don’t live in machines in your garage, but are spread across huge data centers around the world. Cloud providers can help meet increasing demands for reliability – for example, they offer dynamic resource allocation that can handle usage spikes. At the same time, going cloud native means not having a physical server onsite that you can fiddle with, introducing its own unique challenges. ‍

How to Achieve Measurable Reliability Results

Reliability is more important than ever. As users depend on services more and more, and competition in every sector grows, a great digital experience becomes the baseline for expectations, not the ceiling. It’s crucial to invest in making your software reliable enough to keep customers happy. ‍ But what does investing in reliability look like?