Operations | Monitoring | ITSM | DevOps | Cloud

April 2023

Sponsored Post

Scaling Site Reliability Engineering Teams the Right Way

Most SRE teams eventually reach a point in their existence where they appear unable to meet all the demands placed upon them. This is when these teams may need to scale. However, it's important to understand that increasing team capacity is not the same as increasing the number of people on the team. Let's unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you're done.

SRE vs. DevOps vs. Platform Engineering: What's The Difference?

SRE, DevOps and Platform Engineering are important concepts in today's world of software development. There are dedicated teams to manage these areas, each with a unique primary focus, set of responsibilities, tools and metrics used to gauge their performance requirements. This article explains SRE, DevOps, and Platform Engineering, including similarities and differences, and, most importantly, how these teams help streamline modern software development, delivery, and maintenance processes.

2023 SRE Report

Now in its fifth year, The SRE Report has become the trusted source of trends and insights for reliability-as-a-feature practices. This year in partnership with Blameless, the report contains special contributions from Adrian Cockcroft and Steve McGhee and highlights findings from a global community of reliability practitioners, including SREs, managers, architects, and executives. As ever, we found some familiar trends and some thought-provoking anti-patterns.

Install Prometheus on Kubernetes: Tutorial & Examples

As one of the most popular open-source Kubernetes monitoring solutions, Prometheus leverages a multidimensional data model of time-stamped metric data and labels. The platform uses a pull-based architecture to collect metrics from various targets. It stores the metrics in a time-series database and provides the powerful PromQL query language for efficient analysis and data visualization.

Incident Response Guide

Site reliability engineering (SRE) is a critical discipline that focuses on ensuring the continuous availability and performance of modern systems and applications. One of the most vital aspects of SRE is incident response, a structured process for identifying, assessing, and resolving system incidents that can lead to downtime, revenue loss, and brand reputation damage.

Is your incident management solution creating more problems than it solves?

When it comes to incident response, the ability to adapt and customize your approach is key. Every organization has unique needs and workflows, and a one-size-fits-all solution simply won't cut it. That's why Blameless is proud to offer a flexible platform that allows teams to tailor their incident response process to fit their exact requirements.

Squadcast + HaloPSA Integration: Enabling Streamlined Incident Response & Alerting

HaloPSA is a modern and intuitive all-in-one professional services automation (PSA) solution, designed for service providers. HaloPSA’s cloud platform helps you manage your entire business, modernize customer experience and automate your service. If you use HaloPSA for PSA requirements, you can integrate it with Squadcast, an end-to-end Incident Response and Reliability Workflow platform, to route detailed alerts from HaloPSA to the right users in Squadcast.

Platform Engineering 101: Origins, Goals, DevOps vs SRE & Best Practices

Platform engineering is the practice of automating infrastructure operations and enabling self-service infrastructure capabilities within collaborative Dev, Ops and QA teams. It involves designing and building platforms, technologies and workflows that enable self-service capabilities to automatically manage, provision and operate complex modern software architecture environments.

Reduce time to detect with AppDynamics Cloud Log Analytics

How machine learning in AppDynamics Cloud accelerates log analysis and reduces mean time to detect. Site recovery engineers (SREs) need to investigate unknown problems reported in production. The common approach is to search and filter log files to find the root cause, and we all know how painful it is to sift through log contents. It’s like finding a needle in a haystack. A machine learning approach is essential to assist SREs to quickly identify the root cause.