Operations | Monitoring | ITSM | DevOps | Cloud

Built-in Application Resiliency Allan Shone  Failover Conf 2020

When starting a new application build, starting with an eye on resiliency prevents headaches down the line. There are many ways to tackle this, especially within different language environments and system eco-systems, but there are many shared across them all. Getting a high-level take-away list to use as a reference later, from a dive into them during this talk, viewers will learn how to develop software that is more fault-tolerant and able to with-stand impact of failures.

Pitfalls in Measuring SLOs  Danyel Fisher & Liz Fong-Jones  Failover Conf 2020

We built support for SLOs (Service Level Objectives) against our event store so we could monitor our own complex distributed system. In the process of doing so, we learned that there were a number of important aspects that we didn’t expect from carefully reading the SRE workbook. This talk is the story of the missing pieces, unexpected pitfalls, and how we solved those problems. We’d like to share what we learned and how we iterated on our SLO adventure.

Human-in-the-Loop DevOps  Taylor Barnett  Failover Conf 2020

Within DevOps, automation has become a North Star. We want to automate the toil away, but the goal of "no toil" is unattainable. Many runbooks can only be partially automated because they still require human intervention and insights. Human-in-the-Loop DevOps is the idea that we can benefit from automating toil while still embracing the human interaction in specific tasks.

The Future of DevOps is Resilience Engineering  Amy Tobey  Failover Conf 2020

For more than a decade, many of us have been working to bring Devops to organizations around the world. We’ve made amazing progress, but there’s so much more to do. Now that we have continuous integration & deployment widespread and developers are taking more ownership of production, what’s next? Amy will talk about what Resilience Engineering is, how it relates to devops, and how she thinks it gives us the science and research we need to take our organizations to the next level of robustness while remaining agile and growing our ability to care for the people around us.

Identifying EC2 Right Sizing Opportunities for Cost Optimization | Datadog Tips & Tricks

In this video, you’ll learn how to identify right sizing opportunities for your EC2 instances utilizing Datadog metric dashboards. Optimizing your cloud footprint for cost efficiency can be a huge task, especially for large and scaling environments. Utilizing time series data and toplists, Datadog dashboards allow you to see chronically underutilized EC2s in your AWS environment. Template variables allow you to sort EC2s by teams and instance types, so you quickly identify the scope of cost saving opportunities across your organization.

Introduction to Service Request Automation with Cherwell

A brief introduction to Service Request Automation using the Kelverion Runbook Suite. The Kelverion Runbook Suite provides a cloud automation platform with a range of automation tools including; a rich graphical design experience, smart integrations, ready built solutions and the option of an easy to configure self-service automation portal.

Tip of the Day - Effective DEM

A truly comprehensive Digital Experience Monitoring strategy requires monitoring from the perspective of the end user with an outside-in approach. Solutions that claim to offer DEM often use nodes placed in cloud providers to conduct synthetic monitoring tests, which doesn't emulate the flow of traffic from actual users. Learn how Catchpoint's DEM platform provides true observability data that ensures availability, reachability, performance, and reliability through proactive synthetic and real user monitoring.

Distributed tracing with OpenTelemetry - Stack Doctor

Wanting to measure the latency of user requests, and know how long each microservice takes to return a response? In this episode of Stack Doctor, we’ll walk you through how to use OpenTelemetry for tracing, and how this tool shows how your requests traverse your service and how each service contributes to overall latency.

Patching Operating Systems While Working from Home

A security flaw in your organization can cause huge financial loss, credibility and litigation issues. Enabling patch management in your organization can help you fix and avoid the chance of any vulnerabilities in the future. OpsRamp's built-in patch management capabilities can provide a 24*7 availability of your infrastructure. OpsRamp supports patch management for two types of environments: Windows and Linux.