March 2022

SRE Toil | What It Is & Top Tips To Reduce It

Mar 31, 2022 By Myra Nizami In Blameless

Being affected by SRE toil? We define what SRE toil is, discuss how it can adversely affect your productivity, and tell you the best techniques to reduce it.

Read Post

Blameless

Read more about SRE Toil | What It Is & Top Tips To Reduce It

New StackPod Episode: Implementing an SRE Practice with Yousef Sedky of Axiom/Hyke

Mar 31, 2022 By Annerieke Kortier In StackState

For our latest StackPod episode, we invited Hyke’s DevOps team lead and AWS Cloud architect: Yousef Sedky. Axiom Telecom is one of the largest telephone retailers in the United Arab Emirates and Saudi Arabia and Hyke, its sister company, is a distribution platform for mobile products.

Read Post

StackState

Read more about New StackPod Episode: Implementing an SRE Practice with Yousef Sedky of Axiom/Hyke

SRE vs. Platform Engineering: The Key Differences, Explained

Mar 29, 2022 By JP Cheung In Rootly

Site Reliability Engineering (SRE) teams and Platform Engineering teams share similar goals -- like maximizing automation and reducing toil -- and similar methodologies. But they have different priorities, and use somewhat different tools to achieve them. What are SREs, what are platform engineers and how is each role similar and different? This article explains.

Read Post

Rootly

Read more about SRE vs. Platform Engineering: The Key Differences, Explained

DevOps Monitoring Tools

Mar 29, 2022 By Emily Arnott In Blameless

Wondering about DevOps monitoring tools? We explain what DevOps monitoring is, the tools you need, how they work, and their pros and cons. What are DevOps monitoring tools? DevOps monitoring tools are used to track application performance, potential system vulnerabilities, infrastructure health, and other performance metrics.

Read Post

Blameless

Read more about DevOps Monitoring Tools

HugOps During Downtime: Building Empathetic Teams

Mar 29, 2022 By Aimee Pearcy In Reliably

While DevOps focuses on software, HugOps focuses on the people behind the software. HugOps is a way to show empathy and appreciation for the real people who are involved in building, shipping, and running software. It’s a way to acknowledge and celebrate those – the Service Reliability Engineers (SREs), SysAdmins, Engineers, and Support Staff – who are working tirelessly behind the scenes to keep the services that we rely on running smoothly.

Read Post

Reliably

Read more about HugOps During Downtime: Building Empathetic Teams

How important is Observability for SRE?

Mar 27, 2022 By Ricardo Castro In Squadcast

Observability is what defines a strong SRE team. In this blog, we have covered the importance of observability, and how SREs can leverage it to enhance their business. Observability is the practice of assessing a system's internal state by observing its external outputs. Through instrumentation, systems can provide telemetry such as metrics, traces, and logs that help organizations better understand, debug, maintain and evolve their platforms.

Read Post

Squadcast

Read more about How important is Observability for SRE?

Rundeck + Squadcast Integration: Simplifying Alert Routing

Mar 25, 2022 By Vishal Padghan In Squadcast

Rundeck is an automation tool that helps to make existing automation, scripts, and commands more secure, auditable, and easier to run. It is a software Job scheduler and Run Book Automation system that automates routine processes across development and production environments. It brings together tasks scheduling, multi-node command execution, workflow orchestration. It also logs everything that happens in the system. Squadcast is an end-to-end incident response tool.

Read Post

Squadcast

Read more about Rundeck + Squadcast Integration: Simplifying Alert Routing

SREcon 2022 Americas Wrap Up

Mar 24, 2022 By Emily Arnott In Blameless

Hi everyone! We had a fantastic time at SREcon 2022 Americas last week, and I thought I’d share our stories and experiences. As the SRE community grows and evolves, these chances for collaboration become more and more important… and fun! Although I only attended virtually, I could still feel an exciting atmosphere as great minds came together.

Read Post

Blameless

Read more about SREcon 2022 Americas Wrap Up

SolarWinds Orion + Squadcast: Alert Routing Made Easy

Mar 24, 2022 By Vishal Padghan In Squadcast

SolarWinds Orion is a scalable infrastructure monitoring and management platform. It is designed to simplify IT administration for on-premises, hybrid, and software as a service (SaaS) environments, in a single pane of glass. SolarWinds Orion ensures you do not have to struggle with numerous incompatible point monitoring products, as it consolidates the full suite of monitoring capabilities into one platform with cross-stack integrated functionality. Squadcast is an end-to-end incident response tool.

Read Post

Squadcast

Read more about SolarWinds Orion + Squadcast: Alert Routing Made Easy

Shift Left Reliability Meetup March -Reliability patterns for serverless applications

Mar 24, 2022 By Reliably In Reliably

Serverless technologies offer a great foundation for building resilient applications that can withstand much turbulence in the production environment. For instance, AWS Lambda automatically deploys your code to three availability zones and replaces faulty virtual machines on the fly. Despite this, there are still many other types of failures that can still affect our application. Perhaps there is an outage with a third-party service we depend on, or maybe a sudden surge in throughput has pushed us over the throughput limit and caused some user requests to be throttled.

View Video

Reliably

Read more about Shift Left Reliability Meetup March -Reliability patterns for serverless applications

Beyond the 4 SRE Golden Signals

Mar 24, 2022 By

SRE's Golden Signals are four key metrics used to monitor the health of your service and underlying systems. We will explain what they are, and how they can help you improve service performance.

Get EBook

Blameless

DevOps
SRE

Read more about Beyond the 4 SRE Golden Signals

Implementing Service Reliability In The World Of Remote Teams

Mar 23, 2022 By Steve Wade In Reliably

In this new era that we are moving into, what does successful reliability look like for modern teams and what are the requirements that will enable us to bring better reliability for our applications and system? With new ways of working, we explore how organziations should implement better service reliability and the different challenges teams are facing.

Read Post

Reliably

Read more about Implementing Service Reliability In The World Of Remote Teams

Five Phases Of Effective Reliability Within Organizations

Mar 23, 2022 By Steve Wade In Reliably

Reliability is important to everybody in a business. There’s a common misconception that it’s just important to engineers. We must change this mindset and think of reliability as a team sport that everyone needs to be part of. As an organization, there are five key phases to implementing effective reliability across teams.

Read Post

Reliably

Read more about Five Phases Of Effective Reliability Within Organizations

Three Pillars Of Service Reliability

Mar 23, 2022 By Steve Wade In Reliably

In order to achieve high levels of reliability for services and products, businesses should consider the three fundamental pillars of reliability: monitoring, release engineering and simplicity.

Read Post

Reliably

Read more about Three Pillars Of Service Reliability

What Is Site Reliability Engineering (SRE)? The SRE Role Explained

Mar 22, 2022 By Joey D'Antoni In SolarWinds

Historically, there was a clear delineation between what system administrators (SysAdmins) do and what application developers are responsible for in IT organizations. In recent years—especially in organizations focused on software development—these worlds have come together as IT operations and development teams adopt DevOps practices. The concept of site reliability engineering (SRE) was first introduced by a much-discussed book titled Site Reliability Engineering from Google.

Read Post

SolarWinds

Read more about What Is Site Reliability Engineering (SRE)? The SRE Role Explained

SRE Revisited: SLO in the age of Microservices

Mar 18, 2022 By Dotan Horovits In logz.io

Site Reliability Engineering (SRE) practice was established by Google nearly 20 years ago, and was popularized with Google’s monumental SRE Book. Everyone’s been attempting to follow that iconic path ever since.

Read Post

logz.io

Read more about SRE Revisited: SLO in the age of Microservices

Honeycomb + Squadcast Integration: Routing Incident Alerts Made Easy

Mar 18, 2022 By Vishal Padghan In Squadcast

Honeycomb is an application monitoring tool that helps DevOps and SRE teams to operate more efficiently by offering rich observability solutions and intuitive team collaboration. It helps understand complex relationships within your distributed systems and troubleshoot issues accordingly. Squadcast is an end-to-end incident response tool. Built with an SRE mindset, it streamlines all the incident response activities.

Read Post

Squadcast

Read more about Honeycomb + Squadcast Integration: Routing Incident Alerts Made Easy

SRE Metrics: Four Golden Signals of Monitoring

Mar 18, 2022 By Stephen Watts In Splunk

SRE (site reliability engineering) is a discipline used by software engineering and IT teams to proactively build and maintain more reliable services. SRE is a functional way to apply software development solutions to IT operations problems. From IT monitoring to software delivery to incident response – site reliability engineers are focused on building and monitoring anything in production that improves service resiliency without harming development speed.

Read Post

Splunk

Read more about SRE Metrics: Four Golden Signals of Monitoring

DevOps vs SRE - Reducing Technical Debt and Increasing Efficiency and Resiliency

Mar 18, 2022 By Ravi Lachhman In Shipa

One more blog topic stemming from our weekly office hours that we hold with the field team here at Shipa. In our last office hours, was asked a question about “what are the difference between DevOps Engineers and SREs?”. Both professions are emerging disciplines and cultures that continue to evolve and play an importance in technology organizations. I’ve been fortunate to have written and spoken about this before; though taking a fresh look at what the two domains try to accomplish.

Read Post

Shipa

Read more about DevOps vs SRE - Reducing Technical Debt and Increasing Efficiency and Resiliency

Salesforce Cloud + Squadcast Integration: Routing Detailed Incident Alerts

Mar 17, 2022 By Vishal Padghan In Squadcast

Salesforce Cloud is one of the leading cloud-based customer relationship management (CRM) solutions. It provides a shared view of your customers and their relationship with the business. With Salesforce Cloud, users can automate service processes and streamline workflows. Squadcast is an end-to-end incident response tool. Built with an SRE mindset, it streamlines all the incident response activities. Squadcast aligns your teams towards a common organizational goal of better reliability.

Read Post

Squadcast

Read more about Salesforce Cloud + Squadcast Integration: Routing Detailed Incident Alerts

Observability for SRE & DevOps Engineer

Mar 16, 2022 By Amartya Gupta In Motadata

Software developments take place quickly as per the client’s requirements. The developments need to take place with safety and precautions. DevOps engineers can help into this matter; however, it is not possible without Observability.

Read Post

Motadata

Read more about Observability for SRE & DevOps Engineer

Severity Levels (What They Are & Why They Matter)

Mar 15, 2022 By Myra Nizami In Blameless

Wondering about severity levels? We explain what incident severity levels are, how to classify them, and how they will affect your incident management process. What are severity levels? Incident severity levels are the measure of the impact an incident will have on a system. In general, a lower number severity level, such as SEV-1, denotes a higher impact on the system.

Read Post

Blameless

Read more about Severity Levels (What They Are & Why They Matter)

What Does AIOps Mean for SREs? It's Complicated.

Mar 11, 2022 By JJ Tang In Rootly

If you’re an SRE, you might view AIOps with great excitement. By automating complex workflows and troubleshooting processes, AIOps could make your life as an SRE much easier. Alternatively, SREs may choose to view AIOps with disdain. They might think of AIOps as just a fancy buzzword that doesn’t live up to its promises, and that can become a distraction from the SRE tools that really matter. Which perspective is right?

Read Post

Rootly

Read more about What Does AIOps Mean for SREs? It's Complicated.

How to Implement Global View and High Availability for Prometheus

Mar 11, 2022 By Ricardo Castro In Squadcast

Ensuring that systems run reliably is a critical function of a site reliability engineer. A big part of that is collecting metrics, creating alerts and graph data. It’s of the utmost importance to gather system metrics, from several locations and services, and correlate them to understand system functionality as well as to support troubleshooting.

Read Post

Squadcast

Read more about How to Implement Global View and High Availability for Prometheus

Five Nines Availability | Is It Achievable?

Mar 10, 2022 By Noor-ul-Anam Ruqayya In Blameless

Wondering about five nines availability? We explain what five nines availability is, why it’s important, how to measure it, and whether it’s an achievable goal.

Read Post

Blameless

Read more about Five Nines Availability | Is It Achievable?

Jira Follow Up Actions by Incident Type

Mar 9, 2022 By Emily Arnott In Blameless

We’re excited to announce our Jira integration is leveling up to make tracking Blameless incidents in Jira faster, smoother, and more powerful. Teams can now specify the way incidents get categorized into projects, and we’ve also enhanced the overall user experience. Let’s take a closer look.

Read Post

Blameless

Read more about Jira Follow Up Actions by Incident Type

AppScope 1.0: Changing the Game for SREs and Devs

Mar 8, 2022 By The AppScope Team In Cribl

SREs and Devs are used to solving problems even when an awkward or inefficient way is the only way. In AppScope 1.0, SREs and Devs have a new alternative to standard methods, that the AppScope team thinks will make that problem-solving a lot more fun. We in the AppScope team constantly hear firsthand about life in the SRE trenches. For this blog, we “interview” a fictional SRE/Dev whose thoughts and comments are a mash-up of things we’ve heard from real people we know.

Read Post

Cribl

Read more about AppScope 1.0: Changing the Game for SREs and Devs

Runbook Automation | What It Is & How To Do It

Mar 8, 2022 By Myra Nizami In Blameless

Looking into runbook automation? We explain how runbook automation works, with examples and tips on how to use it to streamline your incident response process.

Read Post

Blameless

Read more about Runbook Automation | What It Is & How To Do It

ServiceNow + Squadcast Integration: Automate IT Ticketing and Project Tracking

Mar 4, 2022 By Nir Sharma In Squadcast

ServiceNow is a workflow automation platform used by organizations for their IT ticketing and project management needs. In contrast, Squadcast is an end-to-end incident management and SRE platform that is used by organizations for their reliability requirements.

Read Post

Squadcast

Read more about ServiceNow + Squadcast Integration: Automate IT Ticketing and Project Tracking

SREs, Apply These 6 Rules to Improve the Performance of Your Software

Mar 4, 2022 By Theo Schlossnagle In Circonus

In today’s world, the performance of your IT systems has a direct impact on your brand reputation and overall business revenue. A “good enough” approach to software performance is no longer good enough. This has led to the growing importance of SREs and a shift to more sophisticated, advanced observability that requires moving beyond basic on/off monitoring to advanced monitoring techniques.

Read Post

Circonus

Read more about SREs, Apply These 6 Rules to Improve the Performance of Your Software

What SREs Can Learn from Capt. Sully: When to Follow Playbooks

Mar 4, 2022 By Andre King In Rootly

When are you smarter than your playbooks, and when are your playbooks smarter than you? That’s a question that engineers rarely step back to consider. The rational, disciplined parts of our minds tell us that the playbooks we are supposed to follow were carefully designed and tested, and that we should stick to them at all costs.

Read Post

Rootly

Read more about What SREs Can Learn from Capt. Sully: When to Follow Playbooks

Incident Response Lifecycle | A Complete Explanation

Mar 3, 2022 By Emily Arnott In Blameless

Wondering about the incident response lifecycle? We explain what it is, and how each phase helps lead to effective incident resolution. What is the incident response lifecycle? The incident response lifecycle is an organization’s framework for responding to an incident that disrupts service. The incident response lifecycle contains the following phases.

Read Post

Blameless

Read more about Incident Response Lifecycle | A Complete Explanation

Golden Signals - Monitoring from first principles

Mar 2, 2022 By Safeer CM In Squadcast

Building a successful monitoring process for your application is essential for high availability. In the first of this three-part blog series, Safeer discusses the four key SRE Golden Signals for metrics-driven measurement, and the role it plays in the overall context of Monitoring. Monitoring is the cornerstone of operating any software system or application effectively. The more visibility you have into the software and hardware systems, the better you are at serving your customers. It tells you whether you are on the right track and, if not, by how much you are missing the mark.

Read Post

Squadcast

Read more about Golden Signals - Monitoring from first principles

Kubernetes Health Check Using Probes

Mar 2, 2022 By Squadcast Community In Squadcast

Kubernetes is an open source container orchestration platform that significantly simplifies an application's creation and management. Distributed systems like Kubernetes can be hard to manage, as they involve many moving parts and all of them must work for the system to function. Even if a small part breaks, it needs to be detected, routed and fixed. These actions also need to be automated. Kubernetes allows us to do that with the help of readiness and liveness probes.

Read Post

Squadcast

Read more about Kubernetes Health Check Using Probes

Site Reliability Chats (Mar 2, 2022)

Mar 2, 2022 By Gremlin In Gremlin

Welcome to the first episode of Site Reliability Chats with your hosts Jason Yee @gitbisect and Julie Gunderson @julie_gund.

View Video

Gremlin

Read more about Site Reliability Chats (Mar 2, 2022)

Postmortems Now Called Retrospectives in Blameless

Mar 2, 2022 By Blameless In Blameless

Something big happened at Blameless this month — our “Postmortem” feature was updated to its new name, “Retrospective”. To the naysayer, I suppose you’re thinking, This seems trivial. Different teams call it different names anyway, so why bother making the change? First let me say, thank you for reading our blog and I hope you finish this one through to the end. Now, allow me to explain our reasoning and why we’re excited about this update.

Read Post

Blameless

Read more about Postmortems Now Called Retrospectives in Blameless

Alert Fatigue in SRE: What It Is & How To Avoid It

Mar 1, 2022 By Emily Arnott In Blameless

Wondering about alert fatigue? We describe what it is, how it affects software development teams, and how to avoid it. What is alert fatigue? Alert fatigue is the phenomenon of employees becoming desensitized to alert messages because of the overwhelming volume they receive, and the number of false positives they receive. The risk with alert fatigue is that important information will be overlooked or ignored.

Read Post

Blameless

Read more about Alert Fatigue in SRE: What It Is & How To Avoid It

Operations | Monitoring | ITSM | DevOps | Cloud

March 2022