Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Lightstep Notebooks helps speed troubleshooting for SREs and developers

Digital business is an imperative for 21st-century companies. Increasingly, organizations are directing investments toward technologies that deliver outcomes fast and enable more resilient digital business models. In this landscape, incidents such as software bugs, power outages, or downed networks have major consequences that affect both revenue and customer loyalty.

How To Prepare for a Site Reliability Engineer (SRE) Interview

Site reliability engineering continues to gain traction in software development and IT. SRE is at the crossroads of software development and IT operations. In Ben Treynor’s words, SRE is “what happens when you ask a software engineer to design an operations function.” Site reliability engineering is a way for developers to actively build services and functions to improve the resilience of people, processes and technical systems.

Webinar Recap: How to Avoid Being On Call With Under-Instrumented Tools

“It’s too expensive!” “Do we really need another tool?” “Our APM works just fine.” With strapped tech budgets and an abundance of tooling, it can be hard to justify a new expense—or something new for engineers to learn. Especially when they feel their current tool does the job adequately. But, does it?

Product Roundup: New Blameless Features in June 2022

Summer means things are heating up. And things are definitely heating up at Blameless! We’ve been hard at work delivering new features and capabilities to our customers, so today I wanted to share a quick summary of all the latest. Here are 4 exciting product updates that enhance the way teams manage incidents and deliver reliable products to their customers.

The value of blameless culture - from IC to C-Suite

At CircleCI, CI has a second meaning: Continuous Improvement. We continuously seek out feedback not only to improve our code but to improve our processes and get better at our jobs along the way. This Continuous Improvement starts with one important company value: a blameless culture. Our blameless culture extends into every part of how we operate.

Site Reliability Engineering (SRE) Survey Now Open for 2022 - Calling All Reliability Practitioners and Leaders

In its fifth year, Catchpoint sponsors The SRE Survey, in partnership with Blameless, to uncover new trends and challenges for teams focused on advancing the reliability of digital products.

Squadcast Product Demo | Incident Management | On-call | SRE | Status Page | SLO Tracker | Runbooks

This video explains why Squadcast is a feature-rich solution for SRE, DevOps, and Engineering teams in general. With the ability to help teams quickly mobilize response teams during critical incidents, easily manage on-call schedules, and track SLOs for better SRE, Squadcast is a multi-purpose platform with numerous capabilities. This short video covers everything the product is capable of.

Setting up Route 53 Health Checks

We live in an age where the internet and digital data drive modern day markets, which results in huge amounts of data being generated and consumed. Hence, it has become very important for online platforms to manage this traffic and serve their customers more efficiently. In this blog we will explore the Amazon Route 53 service and see how it addresses domain name system routing and health check problems.