September 2022

Discover Chaos Engineering with Reliably

Sep 28, 2022 By Reliably In Reliably

Experience everything great about Chaos Experiments with the added benefits of our reliability platform. See Reliably in action and discover how to proactively verify with Chaos engineering so you can anticipate impacts on your users and prioritize appropriate actions safely.

View Video

Reliably

DevOps
SRE

Read more about Discover Chaos Engineering with Reliably

On-Call Schedules - Best Practices in 2022 (With Examples)

Sep 28, 2022 By Myra Nizami In Blameless

As users expect incidents and outages to be addressed as quickly as possible, any time of day, on-call rotations have become necessary for SRE and DevOps teams. How do you create on-call rotations schedules that are fair and reduce burnout?

Read Post

Blameless

Read more about On-Call Schedules - Best Practices in 2022 (With Examples)

DevOps vs. SRE: What's the Difference?

Sep 28, 2022 By Cortex In Cortex

Despite there being significant differences in the roles, DevOps and Site Reliability Engineering are often lumped together because many people assume they do similar work. Although both attempt to reduce the issues arising from software development processes, their goals, skill sets, and approaches are actually quite different. DevOps engineers focus on the development pipeline, and their goal is to enable better development processes and workflows.

Read Post

Cortex

Read more about DevOps vs. SRE: What's the Difference?

Service Catalog: Simplifying Service Management and Ownership

Sep 26, 2022 By Vishal Padghan In Squadcast

With the adoption of cloud and microservices, modern IT infrastructures operate with a mesh of services that cater to multiple user requirements. It can get very difficult to simultaneously keep track of numerous services. A Service Catalog helps organize service-related information in a single pane, achieve end-to-end service ownership and get real-time performance insights.

Read Post

Squadcast

Read more about Service Catalog: Simplifying Service Management and Ownership

Exploring PagerDuty Alternatives for Incident Response

Sep 23, 2022 By Nir Sharma In Squadcast

Incident response refers to effectively responding to infrastructure issues and resolving them in the shortest time frame possible. Due to several loss-inducing high-profile outages over the last few years, organizations have sought to create rigorous processes with specialized tools to resolve incidents quickly and learn from their failures. As one of the first platforms to enter the incident response space, PagerDuty is a dominant player, but over the years, competing platforms have begun carving out their own niche in the incident response space.

Read Post

Squadcast

Read more about Exploring PagerDuty Alternatives for Incident Response

The Importance of Observability for Site Reliability Engineers (SREs)

Sep 22, 2022 By 2 Steps Team In 2 Steps

Site reliability engineers (SREs) play a crucial role in ensuring the reliability of systems. From creating software to improving system reliability in production, responding to incidents, and fixing issues, SREs are responsible for guaranteeing the health of applications.. And observability helps support SREs'. Because an observable system allows them to identify and fix issues promptly, resulting in SRE's being better equipped to fast-track development cycles.

Read Post

2 Steps

Read more about The Importance of Observability for Site Reliability Engineers (SREs)

Tips to make your Retrospectives Meaningful

Sep 16, 2022 By Vishal Padghan In Squadcast

If done right, retrospectives can help you inspect past actions, help adapt to future requirements and guide teams towards continuous improvement. However, organizations find it difficult to adopt the right mindset to execute retrospectives effectively. This blog will help you understand what retrospectives are and provide valuable tips to make your retrospectives meaningful. This blog will cover,

Read Post

Squadcast

Read more about Tips to make your Retrospectives Meaningful

What DevOps, SRE and CloudOps teams can learn from CEOs

Sep 15, 2022 By AppDynamics Team In AppDynamics

What can IT teams learn from today’s most successful CEOs? And, perhaps more interestingly, how can IT pros think like a CEO to level up teams across DevOps, SRE and CloudOps? Find out here.

Read Post

AppDynamics

Read more about What DevOps, SRE and CloudOps teams can learn from CEOs

Introducing Webforms - Involve end users directly into your Incident Management process

Sep 14, 2022 By Vardhan NS In Squadcast

Over the years we’ve received requests from our customers for a feature that can enable their customers and their end users to create/ report incidents directly on Squadcast. To our valued customers - we heard you! We are excited to introduce Webforms to do exactly that. In the past, we’ve addressed the challenges pertaining to On-call processes and best practices that teams can implement.

Read Post

Squadcast

Read more about Introducing Webforms - Involve end users directly into your Incident Management process

What's difficult about problem detection? - Three Key Takeaways

Sep 14, 2022 By Emily Arnott In Blameless

Welcome to episode 4 of our webinar series, From Theory to Practice. Blameless’s Matt Davis and Kurt Andersen were joined by Joanna Mazgaj, Director of Production Support at Tala, and Laura Nolan, Principal Software Engineer at Stanza Systems. They tackled a tricky and often overlooked aspect of incident management: problem detection. ‍

Read Post

Blameless

Read more about What's difficult about problem detection? - Three Key Takeaways

Blameless Announces New Integrations with ServiceNow and Microsoft Teams to Centralize and Speed Incident Management

Sep 13, 2022 By Blameless In Blameless

Enterprise Customers Augment Existing Tech-Stack and Significantly Improve Reliability.

Read Post

Blameless

Read more about Blameless Announces New Integrations with ServiceNow and Microsoft Teams to Centralize and Speed Incident Management

Managing Squadcast resources with our expanded Terraform provider

Sep 13, 2022 By Nakul Shetty In Squadcast

Hey folks! We’re excited to announce that we’ve vastly expanded the capabilities of our Terraform provider. Previously, our Terraform provider was limited to creating and managing services as a resource. We have now covered the entire spectrum of resources available on Squadcast right from creating and managing users, escalation policies and also managing SLO’s via our Terraform provider. What does that mean for you?

Read Post

Squadcast

Read more about Managing Squadcast resources with our expanded Terraform provider

Blameless Expands Microsoft Partnership to Deliver Faster, More Intuitive Incident Response Collaboration

Sep 12, 2022 By Phoebe Wang In Blameless

At Blameless, the world’s leading software engineering teams rely on us during incident management. A key part of our offering is the ability to seamlessly integrate with a customer’s unique tech stack. As such, we value partnerships with companies like Microsoft that enhance our user experience and meet the needs of our customers. We understand how essential it is to integrate with communication tools like Microsoft Teams, because it’s the first place a user goes to start an incident.

Read Post

Blameless

Read more about Blameless Expands Microsoft Partnership to Deliver Faster, More Intuitive Incident Response Collaboration

SRE vs DevOps: Can they coexist or do they compete?

Sep 9, 2022 By Gremlin

Systems fail, sometimes publicly and at great cost. Airlines have experienced system-wide ticketing outages, causing hundreds of flight cancellations and significant inconvenience to customers. Retailers have experienced website crashes on the busiest shopping days of the year, costing millions in lost revenue and customer goodwill. It is vital to understand both DevOps and SRE and the roles they play in preventing such outages.

Get EBook

Gremlin

Read more about SRE vs DevOps: Can they coexist or do they compete?

Using Observability with Kubernetes to Automate Site Reliability Engineering

Sep 8, 2022 By StackState In StackState

In this video, Anthony Evans, solution architect, explains how the StackState topology-powered observability platform can help SREs to automate site reliability, putting their organizations on the path to becoming a zero-downtime enterprise. See how StackState helps to unify and correlate data across your stack, visualize your entire IT environment, instantly pinpoint root cause, reduce alert storms and with AIOps capabilities, even prevent problems proactively. It's all here!

View Video

StackState

Read more about Using Observability with Kubernetes to Automate Site Reliability Engineering

What Is Incident Management? Everything You Need To Know

Sep 7, 2022 By Myra Nizami In Blameless

Incidents happen, so how do you handle them? We explain incident management, how to prioritize incidents, and the process involved to resolve the incident. ‍

Read Post

Blameless

Read more about What Is Incident Management? Everything You Need To Know

What is a Security Operation Center and how do SOC teams work?

Sep 6, 2022 By Vishal Padghan In Squadcast

With the growing complexity of IT environments, it is essential to have robust security processes that can safeguard IT environments from cyber threats. In this blog, we will explore how security operation centers (SOCs), help you monitor, identify and prevent cyber threats to safeguard your IT environments. This blog covers the following pointers.

Read Post

Squadcast

Read more about What is a Security Operation Center and how do SOC teams work?

What are the four Golden Signals?

Sep 2, 2022 By Andre Newman In Gremlin

When it comes to building reliable and scalable software, few organizations have as much authority and expertise as Google. Their Site Reliability Engineering Handbook, first published in 2016, details their practices to maintain reliability as Google scaled. But when you have over a million servers running thousands of services across more than twenty data centers, how do you monitor them in a consistent, logical, and relevant way?

Read Post

Gremlin

Read more about What are the four Golden Signals?

How to add a Golden Signal to a service in Gremlin RM

Sep 2, 2022 By Gremlin In Gremlin

In this video, we show you how to add a Golden Signal to a service. Gremlin uses your Golden Signals to ensure your services are still healthy and responsive during reliability tests. You can configure Golden Signals to use an existing monitor in your observability tools, such as Datadog, New Relic, or Prometheus. We recommend adding all four Golden Signals to each of your services to ensure comprehensive coverage.

View Video

Gremlin

Read more about How to add a Golden Signal to a service in Gremlin RM

How to add a Service to Gremlin Reliability Management (RM)

Sep 2, 2022 By Gremlin In Gremlin

This short demo video shows you how to add a Kubernetes service to Gremlin Reliability Management (RM). We'll walk you through selecting the parts of your infrastructure that make up your service, identifying processes for dependency detection, and adding your Golden Signals.

View Video

Gremlin

Read more about How to add a Service to Gremlin Reliability Management (RM)

Introduction to Gremlin Reliability Management (RM)

Sep 2, 2022 By Gremlin In Gremlin

Gremlin Reliability Management helps teams standardize and automate reliability, one service at a time. In this video, we walk through the platform by showing you how to add your services to Gremlin, integrate your Golden Signals, run reliability tests, and generate reliability scores.

View Video

Gremlin

Read more about Introduction to Gremlin Reliability Management (RM)

Operations | Monitoring | ITSM | DevOps | Cloud

September 2022

Discover Chaos Engineering with Reliably

On-Call Schedules - Best Practices in 2022 (With Examples)

DevOps vs. SRE: What's the Difference?

Service Catalog: Simplifying Service Management and Ownership

Exploring PagerDuty Alternatives for Incident Response

The Importance of Observability for Site Reliability Engineers (SREs)

Tips to make your Retrospectives Meaningful

What DevOps, SRE and CloudOps teams can learn from CEOs

Introducing Webforms - Involve end users directly into your Incident Management process

What's difficult about problem detection? - Three Key Takeaways

Blameless Announces New Integrations with ServiceNow and Microsoft Teams to Centralize and Speed Incident Management

Managing Squadcast resources with our expanded Terraform provider

Blameless Expands Microsoft Partnership to Deliver Faster, More Intuitive Incident Response Collaboration

SRE vs DevOps: Can they coexist or do they compete?

Using Observability with Kubernetes to Automate Site Reliability Engineering

What Is Incident Management? Everything You Need To Know

What is a Security Operation Center and how do SOC teams work?

What are the four Golden Signals?

How to add a Golden Signal to a service in Gremlin RM

How to add a Service to Gremlin Reliability Management (RM)

Introduction to Gremlin Reliability Management (RM)

Monthly Archive

Follow Us