Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Librato on Heroku is Going Away and Hosted Graphite Is the Better Next Step

Librato (a SolarWinds product) is being sunsetted summer of 2025, and that directly affects Heroku teams who’ve relied on the Librato add-on for “good enough” visibility into dynos, routers, and Postgres. If you’re in that group, you’ll need a replacement monitoring add-on that keeps you covered on Heroku and lets you grow beyond it without re-architecting how you ship metrics.

The strategic art of build vs. buy in software delivery ft. Tara Hernandez of MongoDB

Rob Zuber sits down with Tara Hernandez, VP of Developer Productivity at MongoDB and former Netscape engineer who helped create early continuous integration systems, to explore strategic frameworks for build vs. buy decisions in modern software delivery.

Jaeger Monitoring: Essential Metrics and Alerting for Production Tracing Systems

Your Jaeger setup is running. Traces are coming in, and the UI is helping you spot slow services or debug broken flows. But just like any part of your observability stack, Jaeger needs some basic monitoring to stay reliable. If the collector starts queueing spans or the agent runs out of buffer, it can lead to dropped traces, sometimes without any obvious sign in the UI. This blog focuses on the operational side of Jaeger.

Building your AI infra, our tips

Modular architecture: Decouple compute from storage so each can scale independently. This makes it easier to adapt to growing or shifting workloads over time. Future-ready hardware: Select GPUs and CPUs not just for current workloads but with an eye on scalability, including support for newer accelerator types. Scalable design: Ensure the system allows seamless addition of compute nodes or storage without a full redesign.

Running AI without blowing up your storage

Storage is often underestimated: In infrastructure discussions, compute and networking get most of the attention, while storage is treated as secondary. For AI workloads, that can be a costly oversight. Data throughput for specialized hardware: AI infrastructure powered by GPUs can process massive volumes of data at unprecedented speeds. This puts immense pressure on the storage system to keep up. Scale-out performance: An on-prem, scale-out, software-defined storage setup allows you to meet high performance demands, grow capacity as needed, and stay in control of infrastructure costs.
Sponsored Post

When AI Becomes the Judge: Understanding "LLM-as-a-Judge"

Imagine building a chatbot or code generator that not only writes answers - but also grades them. In the past, ensuring AI quality meant recruiting human reviewers or using simple metrics (BLEU, ROUGE) that miss nuance. Today, we can leverage Generative AI itself to evaluate its own work. LLM-as-a-Judge means using one Large Language Model (LLM) - like GPT-4.1 or Claude 4 Sonnet/Opus - to assess the outputs of another. Instead of a human grader, we prompt an LLM to ask questions like "Is this answer correct?" or "Is it on-topic?" and return a score or label. This approach is automated, fast, and surprisingly effective.

How To Start A FinOps Career: Roles, Skills, Jobs, And Growth Paths

Want to know how to get a job in FinOps? You’re not alone. FinOps careers are rapidly emerging as essential roles in tech, helping companies manage cloud costs without slowing down innovation. These roles sit at the intersection of finance, engineering, and cloud operations. FinOps roles and responsibilities are expanding fast. In this guide, you’ll learn what FinOps professionals do, how to frame your skills for the job, what certifications help, and how to grow your FinOps career over time.

10 Best Live Call Routing Software for Incident Management

I curated a list of the 10 best Live Call Routing software for incident management. To compare them, I created a checklist of essential features. I then read their documentation to see how they stacks up against my checklist. And finally, I encapsulated the results in three tables: If you are new to live call routing, I’ve included a section that covers the basics for you. Let’s get started! Key highlights.

Security Compliance Management Scanning in Puppet Enterprise

In this session Jason and Nelson provide a walkthrough of Security Compliance Enforcement (SCM) scanning in Puppet Enterprise, emphasizing its seamless integration and ability to check compliance against CIS benchmarks. It highlights the compliance dashboard, which offers immediate insights into security status and supports scanning for up to a hundred thousand nodes.

Provision & Deploy Applications in Minutes with Resolve

Automate end-to-end application provisioning with Resolve Actions. In this demo, we walk through how Resolve automates the entire process of provisioning virtual machines (VMs) and deploying applications—starting from a simple request form, all the way to fully configured and monitored servers. You'll see how Resolve.