Operations | Monitoring | ITSM | DevOps | Cloud


The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Proactively monitor service performance with SLO alerts

Service level objectives (SLOs) state your team’s goals for maintaining the reliability of your services. Adopting SLOs is an SRE best practice because it can help you ensure that your services perform well and consistently deliver value to users. But to gain the greatest benefit from your SLOs, you need ongoing visibility into how well your services are performing relative to your objectives.

What I learned from leading my first incident

A few weeks ago we had a major incident. We were releasing our Practical Guide to Incident Management, and after posting about it online an incident.io employee noticed that the page wasn’t loading. Just to set the scene, I’ve been at incident.io for 3 months and don’t have any experience of incidents in my previous role. When the team got paged I expected this to be one of those “follow along and learn how the wizards work their magic” exercises.

A CFO's Guide To Evaluating Cloud Spend

We have a term we like to use when we meet CFOs who have just gotten their biggest AWS bill ever: bill shock. Bill shock is when finance suddenly rings the alarm that the bill is “too high” and gets everyone scrambling to explain what they’re spending money on. It often happens when the bill reaches a new milestone (the first million, ten million, or hundred million) or growth trajectory (it doubled in a quarter!?). The problem with bill shock is that it can be highly disruptive.

Change Failure Rate explained

This post is the third in a series of deeper dive articles discussing DORA metrics. In previous articles, we looked at: The third metric we’ll examine, Change Failure Rate, is a lagging indicator that helps teams and organizations understand the quality of software that has been shipped, providing guidance on what the team can do to improve in the future.

Insurance Provider Reduces Software Licensing Costs, Saving Millions

A large U.S.-based insurance provider was experiencing rising database software licensing costs. In order to reduce the software licensing costs, the organization needed to complete a comprehensive infrastructure analysis of over 200 physical servers. 75 percent of these physical servers supported one software application, their database solution. Additionally, the software routinely only utilized between two and four cores, despite having 24 cores on each server.

Classifying Severity Levels for Your Organization

Major outages are bound to occur in even the most well-maintained infrastructure and systems. Being able to quickly classify the severity level also allows your on-call team to respond more effectively. Imagine a scenario where your on-call team is getting critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is inoperative revenue losses are mounting every minute. How do you go about getting your application back on track?

Speedscale Traffic Replay is now v1.0

Nate Lee here, and I’m one of the founders of Speedscale. The founding team’s worked at several observability and testing companies like New Relic, Observe Inc, and iTKO over the last decade. Speedscale traffic replay was borne out of a frustration from reacting to problems (even if they were minor) that could have been prevented with better testing.