Operations | Monitoring | ITSM | DevOps | Cloud

In-house vs. MetricFire

You’re ingesting 20,000 data points a second, in 400,000 metrics, from thousands of AWS instances – and your monitoring can’t handle the load. You need a scalable, highly-available monitoring and dashboarding solution (and you need it yesterday). Should you do it yourself with an in-house Graphite or Prometheus monitoring system? Or will you skip the headache and choose a hosted service like MetricFire?

The Netdata Community Powered by NodeBB

We recently adopted NodeBB as our software of choice for building the Netdata Community. We have many good reasons for why we wanted to provide our community with a proper home online, but I wanted to cover some of the technical reasons for choosing NodeBB for our platform, and the many parallels between the NodeBB and Netdata projects, which was certainly a driving force behind this decision.

Logging Best Practices: From Simple to Space Age

It is tempting to consider logging as a simple, solved problem. We write a log, check our file and, boom, we’ve cracked it. Yet those of us who have sat up at three in the morning, trawling through log files over an unreliable SSH connection, know that this is simply not enough. As your system scales, so too must the sophistication of your tooling. Your logging best practices must be scalable and ready to support your efforts.

Django and the N+1 Queries Problem

The N+1 Queries Problem is a perennial database performance issue. It affects many ORM’s and custom SQL code, and Django’s ORM is not immune either. In this post, we’ll examine what the N+1 Queries Problem looks like in Django, some tools for fixing it, and most importantly some tools for detecting it. Naturally, Scout is one of those tools, with its built-in N+1 Insights tool.

Get enhanced Azure cost visualization with SquaredUp 4.7

One of the big problems we hear about with Azure is managing costs and understanding where the money is being spent. In fact, when we launched SquaredUp for Azure back in 2019, the ability to visualize costs quickly became one of the most popular features. It helped our customers (and ourselves, too) get a grip on Azure costs – by making it easy to identify under-utilized resources and take the appropriate action to reduce costs.

Static Thresholds vs. Dynamic Thresholds

IT monitoring is a complex field with several approaches to manage monitoring and alerts. Most of the current monitoring solutions provide Static Threshold-Based alerting, where IT Operations staff are notified when resource utilization breaches the defined threshold. The problem with Static Thresholds is that these are manually adjusted, and tuning it to meet the specific environment and needs of an organization is a major challenge for IT Operations teams.

Loki 1.6.0 released: Metric query performance up to 10x faster, push logs from any client to Promtail, query language and LogCLI enhancements, and more!

Things have been busy with the Loki project! Once again, we waited too long between releases, and there are so many new things I won’t be able to list them all. But that won’t stop me from trying, so let’s get to it. For a change of pace, instead of listing interesting PRs, I’m going to talk through Loki’s components and mention the changes in more of a paragraph style. Let’s see how this goes.

New free tool alert! Try the HTTP Response Header Check

We did it again. We just published a new free tool, the HTTP Response Header Check. This handy little gadget quickly grabs your HTTP response headers for your review. It sounds simple because it is. But as every good DevOps pro knows, it is always a good idea to check your headers from time to time.

Backups Suck (But They Don't Have to)

Focus on what matters with instant visibility into the condition of your backup application and detailed analytics to quickly pinpoint where any issues lie. IBM’s backup monster, Spectrum Protect (TSM as we called back in the day), sucks. Not because the software sucks – it’s actually the best there is – but because backups suck in general. It’s the quintessential necessary evil of IT.

ChaosSearch Announces New Integration With Opsgenie

ChaosSearch is excited to announce its new integration with Opsgenie — Atlassian’s alerting and incident management platform. Using this integration, your teams can leverage the industry’s most powerful and comprehensive data monitoring and analytics capabilities channeled into a unified workflow through Opsgenie’s easy-to-use interface.