Operations | Monitoring | ITSM | DevOps | Cloud

How Okta keeps 99.99 percent uptime with #datadog

How do you maintain 99.99 percent uptime across thousands of Kubernetes hosts and multiple cloud providers? Okta engineers explain why observability is critical to keeping authentication and authorization services running at scale. Watch how Okta uses Datadog to bring metrics, logs, and traces into a single view, speed up root cause analysis, and reduce time to mitigation while controlling costs.

Universal Mesh in action: how PayPal solved multi-cloud complexity with HAProxy

The hardest part of modern infrastructure isn’t choosing your deployment environments — it’s bridging communication between them. Large enterprises are constantly facing the challenge of keeping everything connected, secure, and fast when their infrastructures are spread across different clouds and on-premises systems.

How LinkedIn modernized its massive traffic stack with HAProxy

Connecting nearly a billion professionals is no small feat. It requires an infrastructure that puts the user experience above everything else. At LinkedIn, this principle created a massive engineering challenge: delivering a fast, consistent experience across various use cases, from the social feed to real-time messaging and enterprise tools.

Spotify's performance & control across large monitoring environments with VictoriaMetrics

When your active time series is in the billions and the total number of data points you need to monitor runs into the tens of trillions, you need a high-performance observability solution with operational simplicity. Streaming behemoth Spotify is one such case. Their observability team chose VictoriaMetrics as the fastest monitoring and observability solution on the market.

How Inkeep Monitors Their AI Agent Framework with SigNoz

AI agents are fundamentally different beasts to monitor compared to traditional applications. A single user request can trigger a cascade of 10+ internal operations: sub-agent transfers, tool executions, LLM calls, API requests, each with unpredictable latency and failure modes. When something goes wrong (and with LLMs, things go wrong in creative ways), you need to see the entire execution flow to debug effectively.

How Roblox uses HAProxy Enterprise to power gaming for 100 million daily users

One of the most anticipated presentations at HAProxyConf 2025 came from gaming and user-generated content (UGC) innovators Roblox. Software Engineer Chris Jones and Senior Site Reliability Engineer Ben Meidel gave an enthusiastic and enjoyable presentation, detailing their journey from legacy hardware to a sophisticated, automated, and secure application delivery platform, with seamless, API-powered dynamic configuration and upgrades, supported by the HAProxy Enterprise Dynamic Update Module.

How OpManager powered IT reliability for DWHIN

In healthcare, every moment counts—and for Detroit Wayne Integrated Health Network (DWIHN), every heartbeat depends on a network that doesnt skip one. Serving over 75,000 patients across Detroit and Wayne County, DWIHN’s IT network powers essential behavioral health services, from autism care to crisis intervention. When its systems started showing signs of strain, DWIHN turned to ManageEngine OpManager to bring reliability, clarity, and calm back to its IT operations.