Today we are excited to launch one of our flagship ML assisted troubleshooting features in Netdata – the Anomaly Advisor. The Anomaly Advisor builds on earlier work to introduce unsupervised anomaly detection capabilities into the Netdata Agent from v1.32.0 onwards.
At this point we are well past the third installment of the trilogy, and at the end of the second installment of trilogies. You might be wondering if the second set of trilogies was strictly necessary (we’re looking at you, Star Wars) or a great idea (well done, Lord of the Rings, nice compliment to the books). Needless to say, detecting anomalies in data remains as important to our customers as it was back at the start of 2018 when the first installment of this series was released.
Our Analytics & ML lead Andrew Maguire recently had a chance to share our new Anomaly Advisor feature with the wider CNCF community. In his demonstration he did some light chaos engineering (using Gremlin and stress-ng) to generate some real anomalies on his infrastructure and watch how it all played out in the Anomaly Advisor in Netdata Cloud. There were also some great questions and discussion from the audience around ML in general and in the observability space itself.
A couple months ago, a Splunk admin told us about a bad experience with data downtime. Every morning, the first thing she would do is check that her company’s data pipelines didn’t break overnight. She would log into her Splunk dashboard and then run an SPL query to get last night’s ingest volume for their main Splunk index. This was to make sure nothing looked out of the ordinary.
IT Operations has a wide spectrum of roles and responsibilities. The positions range from level 1 (L1) operators to Site Reliability Engineers (SREs) and everything in between. L1 operators, for example, are (often) almost exclusively reactive. They feed off the constant stream of incidents reported by clients and events that are reported by monitoring and alerting systems. This is in contrast to SREs, who work at the other end of the spectrum.
Here’s a myth that needs to be debunked – the cloud will take care of my performance problems! Our experience shows that cloud architecture usually introduces new layers of complexities that did not exist in the on-premises world. You need a modern AI-powered full stack monitoring solution to find the needle in the multi-layered haystack that is the cloud. Sometimes, it’s the cloud vendor who has to fix the issue.
Moving beyond traditional monitoring to embrace full stack observability offers a seemingly endless range of benefits. Beyond unifying logs, metrics, and traces in a single platform, the opportunity to enlist advanced analytics and engage a more predictive approach represents another huge step forward.