Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Master Class - PCI Compliance and Vulnerability Management for Kubernetes - 2020-05-05

This is the Rancher Master Class with NeuVector that was held on May 5, 2020. In it NeuVector talks about the challenges with PCI-DSS compliance when working with Kubernetes and presents strategies for securing containers and content, both using OSS tools and with their paid solutions.

Kubernetes and the Enterprise Knowledge Graph

In today’s enterprises, we spend much of our time dealing with information, whether it’s data, knowledge or analytics. Just like the assembly line workers of last century, today’s knowledge workers deal with similar logistics of taking raw materials as an input and producing a finished product as an output. Only in this case, the raw material is all the unorganized and sometimes random information at our disposal, and the finished product is structured information.

Monitor ProxySQL with Datadog

ProxySQL is a MySQL/MariaDB protocol–compliant load balancer and reverse proxy with native support for a range of popular backends including ClickHouse, Amazon Aurora, and Amazon RDS. ProxySQL efficiently distributes queries to your database servers and caches results, improving resource management and boosting database performance. You can also configure ProxySQL for high availability to reduce downtime.

Slowdown is the New Outage  Marco Coulter  Failover Conf 2020

While outage-driven news headlines can cause stock prices to plummet short term, the performance-driven reputation loss is a slow burn for longer-term customer loss. This session compares slowdowns vs outages and the resulting need for insight more than observability. By understanding these difference, you'll be ready to drive agile applications, gain funding for lowering technical debt, and focus on customer retention.

The Halo of Resilience Engineering  J. Paul Reed  Failover Conf 2020

Recent world-impacting events have caused us all to have to rethink the way we go about our daily work; in this talk, we'll look at how some of the pillars of Resilience Engineering might help you and your team deal with the changes we're all being forced to confront.

Improving a Distributed System Post-Incident Julius Zerwick Failover Conf 2020

In this session, we will dive into a case study of how a team can recover & improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems.

Built-in Application Resiliency Allan Shone  Failover Conf 2020

When starting a new application build, starting with an eye on resiliency prevents headaches down the line. There are many ways to tackle this, especially within different language environments and system eco-systems, but there are many shared across them all. Getting a high-level take-away list to use as a reference later, from a dive into them during this talk, viewers will learn how to develop software that is more fault-tolerant and able to with-stand impact of failures.

Pitfalls in Measuring SLOs  Danyel Fisher & Liz Fong-Jones  Failover Conf 2020

We built support for SLOs (Service Level Objectives) against our event store so we could monitor our own complex distributed system. In the process of doing so, we learned that there were a number of important aspects that we didn’t expect from carefully reading the SRE workbook. This talk is the story of the missing pieces, unexpected pitfalls, and how we solved those problems. We’d like to share what we learned and how we iterated on our SLO adventure.