Operations | Monitoring | ITSM | DevOps | Cloud

Unpacking the Elements of Site Uptime (by way of Jeopardy!)

Picture this: you’ve achieved your second lifelong dream of being a contestant on Jeopardy! Now it’s time for the fateful “final answer.” The good news? You’ve got a comfortable lead over your fellow contestants, and a correct response means eternal bragging rights. The bad news? Miss this one, and everyone — your family, coworkers, dentist, mechanic — will remind you of it forever. The lights dim. The audience holds its breath.

A quick recap of IDPCON 2025

Two weeks ago, we hosted IDPCON 2025, and the response has been overwhelming. Over 250 engineering leaders from 20+ countries joined us for 12 sessions featuring speakers from Canva, Skyscanner, Blackstone, and more. Attendees participated in discussions at 20+ roundtables, sharing strategies and challenges around engineering excellence and internal developer portals.

Declarative Configuration in OTel (Grafana OpenTelemetry Community Call #1)

We’re kicking off a brand-new Grafana OpenTelemetry Community Call! Join us as we dive into getting observability into your apps and infrastructure with Grafana, powered by OpenTelemetry. In this session, we’ll dive into Declarative Config — the new way to make OpenTelemetry onboarding simple and powerful. Instead of juggling environment variables or boilerplate in your startup code, declarative config gives you a clean, language-agnostic approach that works across SDKs and unlocks future possibilities like remote configuration. Join us with Marylia Gutierrez (OTel JavaScript approver & core contributor) to explore.

How Atlassian built a smarter observability system with Grafana and OpenTelemetry

Discover how Atlassian built OpsDeck, an observability platform powered by Grafana, to automate incident detection, improve response time, and reduce troubleshooting from one hour to under a minute. Hear how the Observability Insights team scaled OpenTelemetry, broke silos, and built smarter workflows for both engineers and support.

Demystifying WMI Permissions

Network administrators are always seeking to gain a deeper understanding of their Windows-based environments. Windows Management Instrumentation (WMI) enables their network monitoring tools to access system information, manage configurations and automate tasks. It provides a vital role in network monitoring by providing a standardized interface for querying and controlling system components. A complex set of permissions governs WMI access.

Clarity in the Dojo: The power of the Summary Agent

In the dojo, not every role is about throwing punches. Some roles are about awareness, the unmistakable voice that tells the fighter when to move, where the strike is coming from, and why the opponent matters. That’s the role of the Summary Agent in Sumo Logic Dojo AI. Unlike a traditional agent, it doesn’t launch queries or carry out actions on its own. Its purpose is to narrate, not act. In doing so, it becomes the foundation for every other decision in the dojo.

How to manage ilert call flows via Terraform

Call flows let you design voice workflows with nodes like “Audio message,” “Support hours,” “Voicemail,” “Route call,” and much more. The ilert Terraform provider now includes a ilert_call_flow resource so you can version and promote these flows across environments. This blog post offers an overview of managing call flows in Terraform, detailing the benefits and key scenarios.

Why your Kubernetes clusters and GPUs should live under one roof

The world remains abuzz with AI hype, but the reality is that most modern applications aren’t purely AI workloads. The average company will have web services, APIs, databases, and background jobs running alongside its machine learning inference or training components. An architecture question everyone faces: should your Kubernetes cluster and GPU compute live in the same data center, or can you split them across providers?

What Is Incident Response Lifecycle?

The Incident Response Lifecycle is a step-by-step process that helps engineering teams detect, respond to, and recover from unexpected system disruptions or outages. It includes a series of six practical stages: Detection, Analysis, Impact Mitigation, Incident Resolution, Service Restoration, and Post-Incident Analysis. By following this lifecycle, teams can minimize downtime, reduce business impact, and continuously strengthen system reliability.