Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Sponsored Post

The Importance of Observability for Site Reliability Engineers (SREs)

Site reliability engineers (SREs) play a crucial role in ensuring the reliability of systems. From creating software to improving system reliability in production, responding to incidents, and fixing issues, SREs are responsible for guaranteeing the health of applications.. And observability helps support SREs'. Because an observable system allows them to identify and fix issues promptly, resulting in SRE's being better equipped to fast-track development cycles.

Tips to make your Retrospectives Meaningful

If done right, retrospectives can help you inspect past actions, help adapt to future requirements and guide teams towards continuous improvement. However, organizations find it difficult to adopt the right mindset to execute retrospectives effectively. This blog will help you understand what retrospectives are and provide valuable tips to make your retrospectives meaningful. This blog will cover,

Introducing Webforms - Involve end users directly into your Incident Management process

Over the years we’ve received requests from our customers for a feature that can enable their customers and their end users to create/ report incidents directly on Squadcast. To our valued customers - we heard you! We are excited to introduce Webforms to do exactly that. In the past, we’ve addressed the challenges pertaining to On-call processes and best practices that teams can implement.

Managing Squadcast resources with our expanded Terraform provider

Hey folks! We’re excited to announce that we’ve vastly expanded the capabilities of our Terraform provider. Previously, our Terraform provider was limited to creating and managing services as a resource. We have now covered the entire spectrum of resources available on Squadcast right from creating and managing users, escalation policies and also managing SLO’s via our Terraform provider. What does that mean for you?

Using Observability with Kubernetes to Automate Site Reliability Engineering

In this video, Anthony Evans, solution architect, explains how the StackState topology-powered observability platform can help SREs to automate site reliability, putting their organizations on the path to becoming a zero-downtime enterprise. See how StackState helps to unify and correlate data across your stack, visualize your entire IT environment, instantly pinpoint root cause, reduce alert storms and with AIOps capabilities, even prevent problems proactively. It's all here!

What is a Security Operation Center and how do SOC teams work?

With the growing complexity of IT environments, it is essential to have robust security processes that can safeguard IT environments from cyber threats. In this blog, we will explore how security operation centers (SOCs), help you monitor, identify and prevent cyber threats to safeguard your IT environments. This blog covers the following pointers.

How to add a Golden Signal to a service in Gremlin RM

In this video, we show you how to add a Golden Signal to a service. Gremlin uses your Golden Signals to ensure your services are still healthy and responsive during reliability tests. You can configure Golden Signals to use an existing monitor in your observability tools, such as Datadog, New Relic, or Prometheus. We recommend adding all four Golden Signals to each of your services to ensure comprehensive coverage.

What are the four Golden Signals?

When it comes to building reliable and scalable software, few organizations have as much authority and expertise as Google. Their Site Reliability Engineering Handbook, first published in 2016, details their practices to maintain reliability as Google scaled. But when you have over a million servers running thousands of services across more than twenty data centers, how do you monitor them in a consistent, logical, and relevant way?