Operations | Monitoring | ITSM | DevOps | Cloud

How we built an AI SRE agent that investigates like a team of engineers

We built Bits AI SRE to help engineers investigate and solve production incidents, one of the most difficult aspects of operating distributed systems today. As environments grow more dynamic and complex, resolving issues becomes more challenging. Failures now span more services, involve noisier signals, and encompass larger volumes of telemetry data, making it hard for on-call engineers to find root causes quickly. Today, Bits AI SRE is already helping teams decrease time to resolution by up to 95%.

Automate flaky test fixes with the Bits AI Dev Agent and Test Optimization

Flaky tests are a significant source of inefficiency that impacts many engineering teams. Along with failing your build, they interrupt your entire development flow, generate excessive CI/CD noise, and, critically, compromise developer trust in the test suite itself. Datadog Test Optimization enables you to manage test suites at scale by pinpointing the flakiest tests, analyzing their history across hundreds of runs, and automatically surfacing the root cause.

Applying Feature Flag Context To Your OpenTelemetry Spans | Harness Blog

Integrating feature flag context into OpenTelemetry traces enhances observability by recording flag states as span attributes, making it easier to analyze how specific flags influence application behavior. When you toggle a feature flag, you're changing the behavior of your application; sometimes, in subtle ways that are hard to detect through logs or metrics alone. By adding feature flag attributes directly to spans, you can make these changes observable at the trace level.

Easy Guide for Connecting Redis to a Grafana Data Source

Redis is a widely used in-memory data store, commonly deployed as a cache, session store, message broker, or fast key-value database. Because Redis often sits on the critical path of an application, having visibility into its behavior (memory usage, client connections, command throughput, cache efficiency) is essential for troubleshooting and performance tuning.

Why Aging Networks Put Critical Infrastructure at Risk-and What It Means for Us

Everywhere around us, technology is evolving at lightning speed, yet the networks which underpin these capabilities often lag behind. This gap creates vulnerabilities that can impact everything from energy grids to emergency services. Forbes recently explored this urgent issue in an article featuring insights from our CEO Bruce McClelland, who shared an informed perspective on why modernization is essential, not optional. I encourage you to take a few minutes to read the full article.

How To Calculate Your OpenAI Cost Per API Call (And Why It Matters Now)

OpenAI doesn’t bill per feature, per customer, or per transaction. It bills per token, across multiple models, with usage patterns that can change by the hour. As a result, two API calls that support the same feature can have very different costs. Without a clear way to translate token-level pricing into something product, engineering, and finance teams can reason about, AI spend becomes difficult to forecast and harder to control.

Six FinOps Certifications And Courses To Set You Up For Success in 2026

FinOps is evolving fast, and 2026 is shaping up to be a big year for specialization. While these certifications are ranked from beginner to advanced to help you build skills in the right order, one course stands out as the hottest recommendation right now: FinOps for AI. AI spend is accelerating, ownership is getting murky, and teams are scrambling to keep up. That urgency is exactly why FinOps for AI is generating so much interest heading into 2026.

Should you still pay for SSL certificates?

There’s a particular flavor of skepticism that shows up whenever someone suggests using Let’s Encrypt. The security team crosses their arms. “Free certificates? For production? We’re a serious organization. We use Sectigo.” I get it. You’ve been buying certificates from the same vendors for twenty years. They send you invoices, you pay them, certificates appear. It feels responsible, and free feels like a trap. But is it?

How to Monitor Network Performance for Multi-Site Businesses

When you’re a business managing network performance across 15 branch offices in different cities, you’re going to see some blind spots. Your headquarters may experience consistent connectivity, while remote location experience unpredictable slowdowns that can affect your daily operations.

Agentic AI Essentials: Examining the Hype Around Agentic AI

In the first article of our Agentic AI Essentials series, we’ll establish what makes agentic AI distinct. We’ll look at the process of tool calling and examine how agentic systems convert intelligence into action. We’ll also explore the human fears, pressures, and ambitions that fuel the hype around agentic systems. By sorting the signal from the noise, IT decision-makers can take the first step toward making sound decisions around agentic AI adoption.