Today we are excited to launch one of our flagship ML assisted troubleshooting features in Netdata – the Anomaly Advisor. The Anomaly Advisor builds on earlier work to introduce unsupervised anomaly detection capabilities into the Netdata Agent from v1.32.0 onwards.
At this point we are well past the third installment of the trilogy, and at the end of the second installment of trilogies. You might be wondering if the second set of trilogies was strictly necessary (we’re looking at you, Star Wars) or a great idea (well done, Lord of the Rings, nice compliment to the books). Needless to say, detecting anomalies in data remains as important to our customers as it was back at the start of 2018 when the first installment of this series was released.
Observability data has three types: metrics, traces, and logs. In this article, we will look at how anomaly detection techniques can be applied to time-series metrics for observability use cases. There are many different anomaly detection algorithms, but they all share a common goal: to find data points that are significantly different from the rest of the data. This can be useful for identifying outliers, monitoring for unusual behavior, and detecting errors in data collection.
Our Analytics & ML lead Andrew Maguire recently had a chance to share our new Anomaly Advisor feature with the wider CNCF community. In his demonstration he did some light chaos engineering (using Gremlin and stress-ng) to generate some real anomalies on his infrastructure and watch how it all played out in the Anomaly Advisor in Netdata Cloud. There were also some great questions and discussion from the audience around ML in general and in the observability space itself.
A couple months ago, a Splunk admin told us about a bad experience with data downtime. Every morning, the first thing she would do is check that her company’s data pipelines didn’t break overnight. She would log into her Splunk dashboard and then run an SPL query to get last night’s ingest volume for their main Splunk index. This was to make sure nothing looked out of the ordinary.
IT Operations has a wide spectrum of roles and responsibilities. The positions range from level 1 (L1) operators to Site Reliability Engineers (SREs) and everything in between. L1 operators, for example, are (often) almost exclusively reactive. They feed off the constant stream of incidents reported by clients and events that are reported by monitoring and alerting systems. This is in contrast to SREs, who work at the other end of the spectrum.
Here’s a myth that needs to be debunked – the cloud will take care of my performance problems! Our experience shows that cloud architecture usually introduces new layers of complexities that did not exist in the on-premises world. You need a modern AI-powered full stack monitoring solution to find the needle in the multi-layered haystack that is the cloud. Sometimes, it’s the cloud vendor who has to fix the issue.
Moving beyond traditional monitoring to embrace full stack observability offers a seemingly endless range of benefits. Beyond unifying logs, metrics, and traces in a single platform, the opportunity to enlist advanced analytics and engage a more predictive approach represents another huge step forward.
To explain how cost anomaly detection works in AWS, let’s first look at an analogy. Imagine you strategically propagate, cultivate, harvest, and replenish trees for a thriving forest products company. In keeping with your eco-friendly policy, you only harvest the trees with straight and tall trunks. You leave irregularly shaped trees alone. Your company doesn’t harvest trees you won't use.