Operations | Monitoring | ITSM | DevOps | Cloud

Sponsored Post

End-to-End Testing for Microservices: A 2025 Guide

End-to-end testing has always been a double-edged sword - even more so in the world of microservices. On one hand, E2E tests are critical for validating that all services work together seamlessly in real user flows. On the other hand, many experts warn that heavy reliance on end-to-end testing in a microservices architecture can create a "distributed monolith," slowing down deployments and undermining the very agility microservices promise. There's truth to that: if done poorly, E2E tests can become brittle, flaky, and a bottleneck that reduces your deployment frequency.

Usability Testing Explained: How to Evaluate User Experience and Improve Product Design with Harness

Usability testing is a powerful method for understanding how real users interact with your product, and uncovering the friction, confusion, and hidden roadblocks that can impact the user experience. In this video, we break down what usability testing is, why it matters, and how observing real user behaviour can help teams create more intuitive, user-friendly software.

Resolve's Zero Ticket Minute - Ep. 2 #itautomation #aiautomation #servicemanagement

Last month, Azure + AWS outages spiked global incidents by 250%. Help desks lit up fast. Zero Ticket IT keeps teams steady with proactive updates and instant deflection of those “is it down?” floods.# Don’t miss your 60-second IT news hit.

VirtualMetric DataStream + Elasticsearch: A Smarter Way to Send Logs to Elastic

Elasticsearch has long been the backbone of security analytics for organizations that need fast search, flexible dashboards, and scalable visibility across massive datasets. It powers everything from threat hunting to compliance reporting and real-time investigation. But anyone who has operated Elasticsearch at scale also knows a quiet truth: Elasticsearch is only as strong as the data you feed it. And getting clean, consistent, usable telemetry into Elastic is often the hardest part.

Why am I getting R14/R15 errors in NodeJS? | MericFire

How to Detect, Alert, and Resolve Memory Issues Before They Cause Downtime When applications scale on Heroku, memory-related issues are among the most common (and most frustrating... -_- ) sources of instability. Two of the most notorious culprits are the R14 (Memory Quota Exceeded) and R15 (Memory Quota Hard Limit) errors.

Winning Variations Explained: How to Identify True A/B Test Success With Statistical Confidence

A winning variation isn’t just the version that “looks better”, it’s the version that truly and measurably outperforms the control. In this video, we break down what a winning variation is, how to determine it, and why statistical significance is essential for making confident, data-driven product decisions.

How Roblox uses HAProxy Enterprise to power gaming for 100 million daily users

One of the most anticipated presentations at HAProxyConf 2025 came from gaming and user-generated content (UGC) innovators Roblox. Software Engineer Chris Jones and Senior Site Reliability Engineer Ben Meidel gave an enthusiastic and enjoyable presentation, detailing their journey from legacy hardware to a sophisticated, automated, and secure application delivery platform, with seamless, API-powered dynamic configuration and upgrades, supported by the HAProxy Enterprise Dynamic Update Module.

New Feature Friday: Cortex & AWS

Most teams treat AWS like a black box. Cortex turns the lights on. We now automatically ingest all your AWS resources—from Lambda to RDS—and map them to the services and teams that actually own them. Daily. Automatically. No spreadsheets. No guesswork. Scorecards help you enforce real standards (think: runtime upgrades, tagging hygiene, EOL migrations). Workflows help your engineers self-serve AWS resources without needing to be AWS experts.

Welcome to the Next Frontier: AI on Kubernetes

Last week’s KubeCon Atlanta made one thing abundantly clear, Kubernetes is quickly becoming the de facto platform for AI workloads – with the event lineup chock full of talks, workshops, and even co-located events dedicated to AI, machine learning and running data on Kubernetes natively – with approximately 50 (!) sessions in total focused on AI, ML, LLM, and GenAI topics.. What was until now mostly PoCs and aspirational is now truly delivering in production.

Incident Postmortem: How to Learn From Failures and Build Reliable Systems

When the issue settles, and systems are back, one question always remains: What actually happened, and how do we stop it from happening again? That’s where incident postmortems come in. Not just as documentation, but as a structured way to learn, improve reliability, and replace guessing with clarity. A good postmortem isn’t about blame, heroics, or perfect narratives. It’s about truth, learning, and building systems that get stronger with every failure.

7 Common Incident Response Challenges and How to Overcome Them

Incident response teams deal with several challenges. Alert noise, unclear ownership, lack of automation, and more. It’s important to keep an eye on these challenges and resolve them from time to time because they can turn minor issues into major outages. In this blog, we’ll discuss some of the common incident response challenges, how they affect, and how you can resolve them. Let’s dive in!

Incident Response Team: Roles, Responsibilities, and Structure Explained

Incidents don’t wait. They hit production, disrupt users, and pull teams into long recovery cycles. And a well-structured incident response team helps you move fast, limit damage, and restore services without chaos. In this blog, we’ll explain what an incident response team is, its key functions, team composition, and different types of teams. Let’s get started!

New in Redgate Flyway Enterprise - Drift detection and rollbacks just got easier

In our latest Redgate Flyway Enterprise release, you can store a snapshot directly in the target database, making drift detection and rollback strategies easier and more reliable whether you’re using state-based or migrations-based deployments.

Instrument Jenkins With OpenTelemetry

You can instrument Jenkins with OpenTelemetry using the official plugin and an OpenTelemetry Collector, then send the data to a backend like Last9 to understand where pipeline latency and failures actually originate. Jenkins provides job status and console logs, but it doesn't show how time is distributed across stages, agents, plugins, and external systems. OpenTelemetry fills that gap by emitting traces, metrics, and logs in a standard format that any OTLP-compatible backend can process.

How to Connect Salesforce to Tableau and Use Near-Real-Time Data Across Teams

The effectiveness of Tableau Salesforce integration depends on one decisive factor: the connector. While Tableau’s native connector is straightforward and offers quick access, it lacks support for complex joins, uses scheduled extracts for refreshes, and doesn’t extend to other BI or ETL platforms. To overcome these constraints, many organizations implement ODBC Drivers, which deliver SQL depth and governance designed for analytics at scale.

Caching in C#: A Comprehensive Technical Guide

In.NET systems, performance is often won or lost on the read path. Every extra database or API call adds latency and cost. Caching fixes that by keeping frequently used data, like product lists or lookups, close to your code, turning slow trips into instant reads. This is not theoretical, it works in the real world. Stack Overflow runs a two-tier cache (in-process + Redis), where a Redis hop takes only 0.2–0.5 ms and local memory reads are effectively instant.

From Concept to Reality: The Journey Behind Harness Database DevOps

Harness Database DevOps was born from a simple question - how can database delivery be as seamless and safe as application delivery? Through deep collaboration with design partners, open-source learnings, and relentless iteration, the team built a platform that unites developers, DBAs, and DevOps under a single, automated workflow. At its core, it’s a story of empathy-driven engineering - transforming database change management into a faster, more reliable, and collaborative experience.

Defend Against Shai-Hulud 2.0 Supply Chain Attack with Harness SCS

Shai-Hulud 2.0 shows how quickly a compromised maintainer account can result in thousands of infected NPM packages and repositories within hours. Harness SCS provides end-to-end SBOM visibility, policy enforcement to block compromised NPM packages, and complete traceability to detect malicious components early and prevent them from entering your pipelines.

Announcing HAProxy 3.3

HAProxy 3.3 is here, and this release brings downloadable packages compiled by HAProxy Technologies, numerous TLS enhancements including expanded ACME support, better observability with persistent stats over reloads, and many improvements to performance and flexibility such as support for QUIC on the backend. These powerful capabilities help HAProxy remain the G2 category leader in API Management, Container Networking, DDoS Protection, Web Application Firewall (WAF), and Load Balancing.

How generative AI solves healthcare's 1% carbon footprint

The healthcare industry accounts for 1% of the global carbon footprint, and a single PET CT scan can generate 60kg of CO₂! Regent Lee, Professor at the University of Oxford and moonshot engineer, reveals how Civo-powered Generative AI is transforming radiology. His team's solution eliminates pharmaceutical contrast injections, digitally displacing the pollution. This technology makes radiology safer, more efficient, and significantly greener for the environment. Sustainability in healthcare is non-negotiable.

(AusBiz) | How to Stay Secure in an AI-Driven Software World | The Last Call

In an era of AI-powered development, how do teams move fast and stay secure? JFrog SVP APAC, Sunny Rao, joined AusBiz’s The Last Call to break it down — from securing the software supply chain to why end-to-end visibility is now essential for every tech organization. Discover why this matters for the future of software and AI-driven innovation.

Deeper Coverage with Less Complexity - New in DataStream

This month’s DataStream update brings meaningful improvements across pipeline management, MSSP workflows, and endpoint visibility. We’ve focused on giving security teams more control over how data moves through their environment, expanding coverage for both Windows and Linux, and strengthening governance for multi-tenant deployments. Let’s walk through what’s new.

10 platform engineering tools your devs will thank you for

Modern engineering teams are shipping more services, managing more complex infrastructure, and moving faster than ever. But this velocity often comes at a cost to the developer experience. Engineers are frequently bogged down by infrastructure complexity, inconsistent tooling, and a lack of clear standards, which leads to cognitive overload and slower cycle times.

Metrics That Matter In FinOps: Co-Create Value With Engineering And Finance Collaborations

FinOps thrives on clarity, and clarity is built on metrics. Metrics give engineering and finance a shared language to understand costs, evaluate trade-offs, and guide innovation. The most impactful metrics go beyond “how much are we spending?” and help us answer: When we measure these things, we stretch beyond tracking progress to fueling it.

Smooth Operator: The Role Of Autonomous FinOps In Cloud Cost Management

(Almost) everyone is using generative AI, and just as many aren’t seeing any benefits. Research firm Gartner calls it the “gen AI paradox” — nearly 80% of companies say they’ve invested in generative solutions, and the same number report no benefits to their bottom line. What’s more, 90% of projects are stuck in pilot mode; ready to take off, but just can’t get up to speed.

Harness AWS: From Code to Cloud, Smarter and Faster

Harness makes software delivery in AWS faster, safer, and more delightful. Harness, the AI Platform for Everything After Code, offers CI/CD, infrastructure-as-code management, and cloud cost management capabilities tailored to the AWS environment. Harness has come a long way since its 2019 debut on the AWS Marketplace. Back then, over half of Harness customers were already running on AWS, and Harness focused on delivering Continuous Delivery as a Service for AWS applications.

AI and DevOps in 2025: How Autonomous Engineering Will Transform Software Operations and Reliability

DevOps started as a way to break down barriers between development and operations, but by 2025 the movement has shifted into something far more ambitious. Instead of simply speeding up releases or tightening workflows, companies are now adopting autonomous engineering systems-tools powered by AI that don't just support DevOps practices but actually carry them out.

KubeCon Retrospective: Platform Engineering Needs to Do More Testing

Every year, KubeCon offers a candid look at where the cloud-native community stands — the tools gaining traction, the pain points teams share, and the big gaps still holding organizations back. After a week of deep conversations, session hopping, and talking to dozens of platform teams, one theme became impossible to ignore: Platform engineering still isn’t doing enough testing. And even more surprising: many teams don’t think testing is their responsibility.

How LEO satellite technology is transforming enterprise connectivity

How LEO satellite technology is transforming enterprise connectivity It’s not new to say that reliable, high-speed connectivity is critical for businesses. Yet even as global networks expand, many enterprises still face connectivity challenges, whether operating in remote industrial zones, across oceans or in developing regions with limited terrestrial infrastructure.

Drowning in Tickets? Your IT Service Desk Solution Might be Why

You hear that? It’s the unmistakable, terrifying flood of tickets rolling in! Password resets, VPN issues, access requests, and performance alerts. The numbers climb faster than the team can respond. You’ve added automation, new tools, even a chatbot or two, but the tide surges on. Here’s a plot twist: sometimes, your IT service desk solution isn’t solving the problem. Sometimes, it’s the thing keeping the problem alive.

Microsoft Sentinel Cost Optimization with Staged Routes and Commit Processors

As security data volumes grow, so do the costs of processing and storing them. Microsoft Sentinel and other SIEM platforms charge based on data ingestion, which makes every decision about normalization rules critical and every duplicate log a direct expense. Enterprise-scale security data pipelines face a persistent problem: data duplication across normalization tiers. As logs move through multiple transformation stages, it’s often impossible to know in advance which version will succeed.

The engineering leader's guide to AI tools for developers in 2026

The holiday shopping season is a familiar ritual for many. We spend hours researching the best deals, comparing features, and reading reviews to make sure we’re investing in the right things. As we all come to grips with the fact that 2026 is right around the corner, engineering leaders are doing the same thing, but largely in response to the explosion of AI developer tools.

From data management to an intelligent data fabric architecture

Large enterprises today manage more machine data than ever before. From legacy applications to modern, ERP and supply chain systems to cloud infrastructure, cybersecurity, and customer-facing applications, much of this valuable data remains trapped in silos, limiting its potential to drive faster decisions, strengthen resilience, and meet the demand for optimum service availability.

IA for AI: Rethinking How We Store, Surface, And Share Data In A Conversational World

Information architecture used to be about structure. We organized menus and pages into trees, built hierarchies, and created pathways for people to follow. For years, that worked. Navigation was the interface. But that world is changing. People aren’t clicking their way through information anymore. They’re asking for it. They’re refining questions, expecting context, and assuming that systems will not only understand what they mean, but act on it.

AI: Your (Not So) Secret Agent In Cloud Cost Control

Read a few articles on artificial intelligence and financial operations, and you’re bound to run across a sentence like this: AI enables FinOps teams to reduce TCO and boost ROI. Or one like this: The future of FinOps uses agentic AI-powered systems to detect and remediate cost issues automatically. Keep reading and you’ll find piece after piece that say a lot about AI and FinOps … without really saying anything.

Perspectives on turbulence part 1: Introducing new research from Pulsant

Since the publication of the inaugural AI Sector Study in 2022, the UK’s AI ecosystem has grown to include more than 5,800 companies – an 85% increase over the past two years. AI revenue is now £23.9 billion, and the sector employs more than 86,000 people. To put that in context, it’s bigger than the UK gambling sector – on both counts. Digital infrastructure is the foundation of this new economy.

Software Release Life Cycle Explained: From Planning to Production

Software doesn’t go live overnight. It moves through a structured, repeatable process known as the software release life cycle — from initial planning to deployment and ongoing maintenance. In this video, we break down each stage of the release cycle: planning, development, testing, staging, deployment, and monitoring. You’ll also see how modern tools like Harness help teams automate CI/CD, feature flags, testing, and progressive delivery to ship better software faster and with less risk.

What Is a T-Test? A/B Testing with Statistical Confidence

When running A/B tests, how do you know if the results are actually meaningful — or just random chance? That’s where the t-test comes in. In this video, we explain how t-tests help you evaluate whether changes in conversion rates, engagement, or performance metrics are statistically significant. You’ll learn how they work, when to use them, and why they’re essential for making confident product, marketing, and engineering decisions.

What Is Multivariate Testing? A Guide to Experimenting With Multiple Elements

A/B testing is great — but what if you want to optimize more than one thing at a time? That’s where multivariate testing (MVT) comes in. In this video, we break down how MVT works, why it gives you deeper insights than A/B testing, and how you can use it to test multiple page or app elements at once. You’ll learn how to run multivariate tests, when to use them over A/B tests, and how Harness makes this seamless inside your CI/CD pipeline.

FinOps Strategy for Hybrid IT: Interview with Tim Conley

FinOps continues to grow in importance as organizations balance cloud services with on-prem systems, legacy applications, and evolving business demands. Many teams want to manage their costs more effectively but are unsure how to apply a FinOps strategy for hybrid IT outside the cloud.

Mocking PostgreSQL the Easy Way: Simplifying Testing with Speedscale Proxymock

Every developer who’s worked with PostgreSQL knows the pain: testing against a real database slows everything down. You need the database running locally, loaded with the right data, and configured to match production as closely as possible. Every time you run a new test or build, you’re forced to repeat that setup migrate schemas, seed test data, and clean everything up again. It’s time-consuming, brittle, and hard to scale across a team.

Grateful for Good Connections: Finding Calm in a Demanding Financial World

As the year winds down, my inbox is overflowing with Black Friday offers and festive greetings. It’s that time when Thanksgiving and the run-up to December holidays remind us to pause and appreciate what truly matters. Yet, in my recent conversations with IT leaders in financial services, I’ve noticed something: the time and calm need to do this feels elusive.

Lessons from KubeCon: What "Best-of-Breed" AI SRE Really Requires

This year’s KubeCon underscored a real shift: AI SRE has gone mainstream. Of course, it’s not a surprise. Teams from high-growth startups to Fortune 500s are running more complex, cloud-native systems, shipping more AI-generated code, and facing rising expectations. Downtime is absolutely not an option and the work for on-call SREs has become unsustainable. The question isn’t whether AI SRE helps. It’s which one you can trust in production.

7 Observability Solutions for Full-Fidelity Telemetry

You don’t have to choose between capturing every signal and keeping costs predictable. Modern observability stacks blend full-fidelity storage (time series or columnar systems like ClickHouse and Apache Druid), tail-based sampling for heavy traffic, and tiered storage (hot/warm/cold with S3-backed archives). This gives you full-fidelity incident forensics with the day-to-day cost profile of a sampled setup.

The $8.8 trillion advantage: how open source software reduces IT costs

Open source software is known for its ability to lower IT costs. But in 2025, affordability is only part of the story. A new Linux Foundation report, The strategic evolution of open source, reveals that open source has evolved from a tactical cost-saving measure to a mission-critical infrastructure supporting enterprise-grade investments, and delivering stronger business outcomes as a result.

Top Causes of Data Center Outages and How You Can Reduce Risk

Outages are less common than they once were, but when they happen, the impact is severe. According to the Uptime Institute Global Data Center Survey 2025, half of data center operators reported at least one impactful outage in the past three years, and one in ten of those caused a serious or severe disruption. The financial risk is just as significant. 20% of operators said their most recent outage cost more than $1 million when accounting for downtime, recovery, and reputational damage.

Inside The Builders Era: Why Developer Craft Matters More Than Ever

The software world has spent the last two years obsessed with one question: “Will AI replace developers?” Wrong question. The right question is: “How do developers stay in control while AI becomes part of the toolchain?” Welcome to The Builders Era, where the craft of software development and AI’s computational power meet on developer terms. Not as a replacement narrative. Not as a threat to our profession.

How Much Did OpenAI's 30,000 CPU Core Optimization Save Them?

I admit I was a little skeptical going into KubeCon 2025. The last time I went, in 2022, it felt tactical. I heard lots of conversations around small solutions to small problems. Practical knowledge-sharing is of course beneficial, but I’m most inspired by the big picture — ideally, a picture bigger than you can see anywhere outside of your mind. I’m heartened to say that KubeCon 2025 was exactly that.

Understanding Kafka with Speedscale #speedscale #kafka #visualization #engineering #production

In this video, we're breaking down the complex world of Apache Kafka and showing you how to gain deep visibility into your event streaming architecture using Speedscale. Kafka is the backbone of modern, cloud-native systems, but understanding what's happening in production—which topics are receiving traffic, where messages are going, and how services are interacting can be a real challenge. We'll cover how Speedscale makes Kafka visualization and debugging simple by.

Free cloud credits: Why your architecture gets lazy and bloated

This is the uncomfortable truth about cloud credits: Short-term savings mask crippling long-term costs. Taken from our recent webinar, Civo CCO Simon Hansford and Canopy Founder James Marks expose the primary concerns of the credit model. Credits act as a dangerous incentive for architectural laziness. When cost isn't a factor, you stop designing for efficiency, leading to bloated, inefficient infrastructure and the inevitable bill shock.

Cloud Efficiency Masterclass: 6 Data-Driven Ways To Reduce Costs And Scale

Discover the basics of cloud efficiency as well as six advanced data-driven strategies you can use to make your cloud environment more efficient. With incredibly complex cloud architecture — that may even include Kubernetes and multi-tenant infrastructure — organizations are finding it hard to measure and monitor the performance and cost of their cloud environments.

4 Golden Signals of System Reliability: A Practical Guide for Your Team

Modern systems produce endless streams of metrics. CPU usage, request volume, cache hit rates, node counts, queue depth, the list keeps growing. With this much data, it’s easy for teams to get lost in dashboards without knowing what actually matters. That’s why DevOps and SRE teams rely on the 4 Golden Signals of System Reliability. They provide the simplest and clearest way to understand user experience and system health.

Incident Management vs Change Management: Key Differences Explained

The Incident Management vs. Change Management are two such moments that highlight a core difference teams face every day. One is a reaction to failure. The other is a planned improvement. That’s the heart of incident management vs. change management. Both keep systems reliable, and both help teams move faster without breaking things. Let’s explore how they differ and how they work together.

Top 7 Observability Platforms That Auto-Discover Services

You can use an observability platform that automatically discovers your services and provides ready-to-use dashboards with minimal setup. If you're running a system where microservices come and go, containers shift around, or serverless functions scale up quickly, this kind of experience saves you a lot of time. You gain visibility as soon as something goes live, without requiring any additional steps on your part. In this blog, we talk about the top seven platforms that offer these capabilities.

New Feature Friday: Understand & Improve Your DORA Performance with Cortex

This week on New Feature Friday, we’re highlighting two new releases that make it easier than ever to understand and improve your DORA performance: DORA Academy Course A guided learning experience that shows you how to use DORA Metrics and Cortex together to drive better engineering outcomes—without the data chaos. DORA Operational Readiness Scorecard An out-of-the-box template that benchmarks each service against DORA standards, giving teams an instant snapshot of where they stand and where to focus.

Enhanced Environment Compliance with Environment Policies

We’re excited to announce an important enhancement to Kosli that will improve how environment compliance is managed across your organization. Starting with our next release, all compliance evaluation for Kosli environments will be consolidated through our powerful Environment Policies feature.

Searching Certificate Transparency Logs (Part 3)

Clickhouse is an incredible database. Here at Certkit, we’ve long worked in the world of “No SQL” databases like Elasticsearch precisely for their ability to query large amounts of data. But for every database, there’s an amount of data that’s “Too big”. Too big to query quickly or too big to store affordably. Clickhouse manages to thread the needle by efficiently storing truly ridiculous amounts of data while still providing impressive query performance.

The 7 Most Common Incident Mistakes (and How to Prevent Them)

The hidden blockers slowing down your incident response and how to remove them before they become reliability risks. Incidents rarely go wrong because of one big failure. Most of the time, it’s a handful of small, familiar mistakes that slow teams down, muddy communication, or create confusion in the heat of the moment. Fortunately, these mistakes are predictable and fixable.

Packaging Operations Runbooks with Puppet Edge Workflows

Puppet Edge Workflows, available with Puppet Enterprise Advanced, provide the orchestration tools to define multistep workflows to run against your infrastructure. This allows Puppet experts to create workflows that Ops teams can run without having deep Puppet language knowledge or the underlying infrastructure.

Searching Certificate Transparency Logs (Part 2)

In the last post we discussed why we’re building our own Certificate Transparency (CT) search tool. There’s good background on the CT ecosystem in that post, so check it out if you haven’t. This post assumes a certain understanding of terminology covered previously. Now that we know where the CT logs live, and the different kinds of logs, we need to start reading them.

Vendor lock-in: not even once

Vendor lock-in remains one of the most significant concerns when choosing a cloud platform. When your data becomes trapped in proprietary formats or services, migration costs skyrocket and your flexibility disappears. This challenge affects organizations of all sizes, from startups planning for growth to enterprises managing complex compliance requirements.

5 Kubernetes Cost Management Insights From CloudZero's Latest Webinar

Kubernetes has reshaped how teams build and scale infrastructure, but it’s also made cost visibility a lot harder. For platform engineers, SREs, and FinOps leads, breaking down shared cluster costs, understanding per-team usage, and driving efficient resource allocation is still a major challenge. That’s why one CloudZero webinar with Umesh Rao, Director, Tech Enablement and John Hashem, Senior Sales Engineer, stood out.

What is Jira Service Management (JSM)? Key Features & Benefits Explained

Atlassian is shutting down OpsGenie. New sales stopped on June 4, 2025. Complete shutdown happens on April 5, 2027. Atlassian wants you to migrate to Jira Service Management (JSM). But like many OpsGenie users, you probably have questions. What is JSM? How does it handle alerting, escalation policies, and on-call schedules? What automation options does it have? Is it the right fit? And more. This blog breaks down everything you need to know.

How to Reduce Log Data Costs Without Losing Important Signals

You can cut your log costs by removing repetitive, low-value logs early and keeping only the parts that genuinely help you understand issues. Modern systems generate logs far faster than you expect. Even when your workload stays stable, infrastructure components, retries, and background workers continue producing a steady stream of repeated entries.

The most important question to ask in the build vs. buy debate

Every growing engineering organization eventually faces the seemingly impossible decision between building a custom solution or buying one off the shelf. It’s a debate that often (and incorrectly) ends by choosing whichever option is less expensive. However, it’s become clear that solving the build vs. buy puzzle boils down to understanding what you want to be good at and whether your internal build is actually unique.

Reliability lessons from the 2025 Cloudflare outage

On November 18, 2025, X, ChatGPT, Shopify, and many other major sites went offline simultaneously. Even Downdetector, Ookla’s popular outage tracking website, briefly went offline. What caused this issue? Why were so many major websites affected by it? And what steps can you take to reduce the impact on your own applications? ‍

What's New in MariaDB 12: Full Release Guide

It is no longer news that MariaDB 12 has officially landed, and it’s making waves across the database world. For over a decade, MariaDB has been the go-to open-source database for developers and businesses seeking a stable and innovative MySQL-compatible platform. This new release further enhances its value. MariaDB 12 delivers major performance upgrades, new features, and significant redesigns to enhance speed, scalability, and developers’ experience.

Better integration tests in Cursor using proxymock

Cursor is fantastic at cranking out code changes. I recently used it to splice a brand-new downstream API call into one of our Go microservices, and the diff looked great. The unit tests finished before I lifted my coffee mug, yet I still had zero certainty the change would survive contact with real traffic. That gap is all about integration tests, so I paired Cursor with proxymock and the outerspace-go demo service to prove the behavior end to end.

Making Your Business Resilient Against Cloudflare Like Outages

Cloudflare-like outages can cost your business a significant amount of money. This week’s Cloudflare global outage is a wake-up call for business resilience. You can stay resilient against such outages by regularly performing resilience testing and updating your application or infrastructure configurations.

Build a multi-agent AI system using CrewAI, Gemini, and CircleCI

Multi-agent AI systems are trending in the software development industry right now. These systems consist of a group of individual agents that collaborate to achieve a desired goal. They mimic real world teams and departments in how they are organized. In multi-agent AI systems, each agent is assigned a task that is required to achieve a final output.

Agents of IT podcast - Ep. 6 - What's real agentic AI and what's just hype?

Sean Heuer and Ari Stowe break down “agent washing,” governance, and what it really means for AI to take action instead of just chatting. In this clip from Agents of IT, they share practical ways to spot the difference between chatbots, scripted automations, and true agentic systems that can plan, reason, and execute autonomously. Watch the full episode to hear their perspective on.

ML inference in PHP by example: leverage ONNX and Transformers on Symfony

This blog is based on a presentation by Guillaume Moigneu at the Symfony 2024 conference. Machine learning and AI are no longer limited to Python and Node.js. PHP developers can now run AI models directly in their applications using modern tools and libraries. This guide shows you how to implement machine learning inference in PHP using ONNX and Transformers.

It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’t changed. Julien breaks down why calling “the same model” from different providers can produce wildly different results, how deployment choices introduce hidden variability, and why reliability teams need to think of LLM systems as distributed systems.

AI wrote the code, but can you trust it? #aicoding #integration #cursor #devops #speedscale

Using AI coding tools like Cursor is fast, but it leaves a massive question: Is the new code going to break production? We solve this by combining Cursor with Proxymock! I take a live traffic snapshot of my running app, feed it back to the AI, and instantly run realistic integration tests locally. It's the only way to get true confidence before you push. Watch the full video below!

The database professional of the future: headlines from Redgate's Keynote at PASS Data Community Summit 2025

Redgate took the main stage earlier today to open PASS Data Community Summit with our keynote, where we shared our vision for the future of the database development experience – one driven by speed, safety, and the intelligent use of AI. As data estates grow in scale and complexity, and as organizations push to deliver software faster than ever, the role of the database is undergoing significant change.

OTel Updates: Complex Attributes Now Supported Across All Signals

OpenTelemetry now supports maps, heterogeneous arrays, and byte arrays across all signals. Here’s where these new types shine — and where simple primitives still fit naturally. If you’ve been working with OpenTelemetry for a while, you’re likely familiar with the straightforward key-value approach to attributes. It’s simple, fast, and works well with how most telemetry backends store, index, and query data.

What is AWS Fargate for Amazon ECS?

As cloud applications moved from VMs to containers and then to microservices, the amount of background work needed to keep everything running grew just as quickly. You gain speed and flexibility, but you also end up managing clusters, scaling rules, and capacity choices that don’t really add to the product you’re building. AWS Fargate steps in right there. It lets you run your ECS tasks without looking after any servers at all.

When to Move From Public Internet to Private Connectivity

Struggling with latency, congestion, or compliance issues? Discover when it’s time to move from public internet to private connectivity. Network operations have never demanded more than they do now, leading many network managers to question whether the public internet is enough. While many organizations begin their network journey with VPNs over the public internet, they often bump into limitations quickly and begin exploring the natural next step – private connectivity.

Get more from your AI chief of staff with these prompts for engineering leaders

Engineering leaders face a constant barrage of questions that pull them away from strategic work. A team lead asks about scorecard compliance. A PM wants a status update on a migration. Someone needs incident trend data for a quarterly review. Each question is reasonable. Each requires context switching, digging through dashboards, or pinging someone on your team for a report. What if you could just ask?

AWS Cost Categories Explained (How To Allocate AWS Spend Accurately)

If you’ve ever tried to make sense of your AWS bill, you know how fast things get messy. Different accounts, hundreds of services, random tags, and suddenly, no one can say for sure who’s spending what or why the total looks so high. It’s not that teams don’t care about costs — it’s that AWS billing data isn’t always easy to interpret. Finance wants accountability. Engineering wants visibility. And somewhere between the two, ownership disappears.

Introducing The Enhanced CloudZero Academy: Learn, Grow, And Level Up Your FinOps Skills

If there’s one thing we’ve learned at CloudZero, it’s that success in FinOps isn’t just about having the right tools. It’s about knowing how to use them, and understanding the “why” behind every number, dimension, and dashboard.

Fast and Clear Presentations for Tech Teams: Saving Time While Explaining Complexity

We've all been there. Technical complexity is our daily bread, but explaining it? That's where things get messy. Technical presentations shouldn't feel like translating ancient hieroglyphs. Yet here we are, drowning audiences in diagrams that look like subway maps. The irony? We build systems for clarity and efficiency, then create presentations that achieve neither. Modern IT teams face a presentation paradox. We need to move fast-really fast-while ensuring everyone actually understands what we're building, breaking, or fixing.

Best Cheap Black Friday VPS Deals - November 2025: A Cost-Based Analysis

It is November 2025 and Black Friday is here, and the VPS hosting world is getting ready for its biggest sale event of the year. Numerous VPS deals with huge discount percentages will appear across every website, but what should be considered is that the real savings aren't always what they look like at first glance. This guide focuses on total cost and provides a few different options that you can consider.

KubeCon + CloudNativeCon 2025: Recap

Hi everyone, my name is Bailey Ahrens, and I’m a marketing intern at Speedscale! I just returned a few days ago from KubeCon + CloudNativeCon North America 2025. While it was only my second trade show, it felt like a huge step forward in my confidence, skills, and future direction. My first conference (API World) was all about stepping outside my comfort zone. KubeCon was where I started leaning into that confidence and finding my place in the tech community.

Tame multi-cluster chaos. A Platform Engineer's guide to distributed Kubewarden Policies with Fleet

For platform engineers managing multiple Kubernetes clusters, maintaining policy consistency is a constant struggle. Manually applying security rules across a growing fleet of clusters is inefficient and error-prone. This approach creates significant risks: As your environment scales, this operational burden becomes unsustainable. Each out-of-sync policy represents a potential security gap, increasing the cluster’s attack surface.

Strengthening Open Source Facter: Ensuring Compatibility and Essential Maintenance

Over the course of 2025, the Puppet Core team has been committed to developing secure, hardened Puppet code that our customers can rely on. As part of that shift, many Puppet platform components, including Facter, were brought under the Puppet Core model and were moved into private repositories.

Maintaining Software Excellence in the Age of AI Coding Assistance

In this preview of his AWS re:Invent session, Cortex CTO & Co-Founder Ganesh Datta breaks down how AI coding assistants are transforming software development, and what high-performing teams are doing to keep speed and reliability in balance. You’ll learn: If you care about AI, engineering velocity, or building sustainable systems, this is a must-watch. Full Session: December 3 at 2:30 PM Learn more about Cortex: go.cortex.io/reinvent.

When Bots Grow Brains: RPA and Agentic AI For the Win

For a long time, robotic process automation (RPA) was the fastest way to scale repetitive digital work. Bots copied, clicked, and executed rule-based tasks faster than any human. They reduced error rates and delivered early wins for efficiency. Sounds just fine, right? Prepare for a Matrix moment, because the truth is that IT teams built RPA only for predictability. It could follow instructions, but it couldn’t adapt when something unexpected happened.

AWS And Azure Outages Will Recur - Here's How You Ensure Resilience

The cloud has long promised limitless scalability and near-perfect uptime. But if you tried to access your Microsoft 365 dashboard or recline your smart bed last week, and got nothing but a spinning icon, you weren’t alone. In the span of 10 days, both Amazon Web Services (AWS) and Microsoft’s Azure Cloud suffered widespread outages that rippled across industries.

KubeCon Atlanta Signals Key Shift: From Cloud Cost To Value Engineering

After three days of demos, sessions, and hallway conversations at KubeCon Atlanta, one thing became clear to CloudZero CTO Erik Peterson: the cloud-native world is shifting from cost control to value engineering. Teams aren’t just fighting bills anymore. They’re fighting complexity, GPU scarcity, Kubernetes sprawl, and pressure from the business to justify every dollar of technical investment. And this year’s KubeCon attendees? They were ready for those conversations.

How Bitbucket powers compliance and code quality at scale

Bitbucket Cloud is more than a code hosting platform. We’re an enterprise partner, helping teams code together at scale with security, compliance, and flexibility at every step. As part of the Atlassian Cloud platform serving more than 300,000 organizations around the world, we’re continuing to build the next generation of Bitbucket Cloud as your trusted cloud vendor, whether you’re a global bank, healthcare provider, or a fast-scaling tech company.

Introducing Native Flyway Support for Harness Database DevOps

Harness Database DevOps has added Flyway support alongside Liquibase, offering teams a choice between structured changelogs and SQL-first migration scripts. This multi-engine approach ensures developers can use their preferred tool while benefiting from centralized governance, automated safety features, and a unified pipeline for all database changes. The goal is to make database delivery safer, more automated, and flexible across the enterprise.

Harness Database DevOps Adds Flyway Support

Harness Database DevOps has added Flyway support alongside Liquibase, offering teams a choice between structured changelogs and SQL-first migration scripts. This multi-engine approach ensures developers can use their preferred tool while benefiting from centralized governance, automated safety features, and a unified pipeline for all database changes. The goal is to make database delivery safer, more automated, and flexible across the enterprise.

Bitbucket Pipelines Advanced Configuration Options | Bitbucket Blitz | Atlassian

Unlock the full potential of your Bitbucket Pipelines! In this video, I introduce four advanced configuration options you can add to your bitbucket-pipelines.yml file to boost performance and cut costs: Try adding a larger Size parameter to your slowest steps, or set up retry conditions for flaky tests to get started.

Boost Developer Experience with LLMs!

Your laptop is powerful enough to run your own LLM. Here's why that matters While centralized AI tools help teams, they miss something critical: your personal knowledge. Meeting notes, tips, tricks, and context only you have. Kyle Fransham shows how running a local LLM changes the game. Index your own "master document of knowledge" and query it right in your dev environment. No cloud needed. The tools are accessible. The setup is simple. And the impact? Game-changing for how you work.

A New, Simpler Microsoft Teams Integration For Redgate Monitor

Microsoft is retiring Office 365 connectors, but there is now a new and easier way to send Redgate Monitor alert notifications to Teams, ready-formatted. Microsoft is retiring Microsoft 365 (Office 365) connectors on 31 March 2026. After that date, any Redgate Monitor alert notifications configured through the old connector method will stop appearing in Teams. From Redgate Monitor 14.1.0, there’s now a new and much simpler way to do it.

.NET Conf 2025 Highlights: Unlocking the Future With .NET 10 and AI Innovations

As the dotConnect team, we are proud to be a sponsor of the.NET Conf 2025. This landmark event highlighted the key advancements of the.NET ecosystem, from major releases to AI-powered tools and inspiring community-driven projects.

Microsoft Fabric Data Warehouse: Features, Benefits, and Use Cases

The Fabric Data Warehouse was built to solve one of analytics’ biggest challenges: fragmentation. When data is spread across separate tools for ingestion, modeling, and reporting, teams lose time, accuracy, and visibility. As part of the Microsoft Fabric ecosystem, the Data Warehouse addresses this by unifying every stage of the analytics process into a single, connected environment.

Which Data Connectivity Product to Choose: ODBC, SSIS, Excel, or Python

Data connectivity solutions are the bedrock of a solid database management strategy. Here’s why. Databases rarely work in isolation. They are constantly interacting with various apps and cloud platforms. As such, ensuring that this interaction flows seamlessly is critical. This is where your business data connectivity solution comes in. But here is the problem. There is no one-size-fits-all connectivity solution.

Automating Chaos Engineering with Terraform

Automating chaos engineering with Terraform eliminates manual setup across environments by enabling you to version control your entire chaos infrastructure, from service discovery to security governance policies. The Harness Terraform provider supports end-to-end automation including Kubernetes infrastructure setup, custom image registries, Git-based ChaosHub management, and granular security controls that ensure safe experiment execution in production.

Top 9 Web Application Performance Monitoring Tools for 2025

You know that uneasy pause before opening your monitoring dashboard? The one where you're hoping nothing's broken—but a part of you knows something probably is. Performance issues often start quietly: a few slow endpoints, a checkout that takes longer than usual, a graph that looks a little off. Before long, those small signals turn into alerts and support tickets.

AI API Aggregation: Managing Costs And Complexity Across Multiple LLMs

Running multiple LLMs without aggregation can feel like managing five different clouds with no dashboard. Sure, you can make it work, but you won’t like the bill. And most SaaS teams didn’t start with a multi-LLM strategy. It just happened. You added one model for reasoning, another for summarization, or maybe a fine-tuned version for customer support. Fast-forward six months, and your AI stack looks like a tangle of APIs. And each charges tokens on its own terms.

3 Signals From KubeCon Atlanta On Where Kubernetes Is Heading Next

KubeCon Atlanta 2025 felt different this year — and CloudZero had a full team on the ground to capture it. Engineers, product leaders, sales reps, and CTO Erik Peterson spent three days embedded across the show floor. Their vantage points were complementary: the outbound conversations, the inbound questions, the demos, the technical deep-dives, and the quieter moments between sessions. Five perspectives stood out.

Reliability lessons from the 2025 Microsoft Azure Front Door outage

On October 29th, 2025, Azure Front Door suffered an outage that impacted Microsoft services on a global level, including Microsoft 365, Outlook, Xbox Live, Copilot, and more. It also affected Microsoft Azure, meaning companies like Costco, Starbucks, and Alaska Airlines ran into issues for both customer-facing and internal systems. The root of the issue was a misconfiguration in the data plane for Azure Front Door and the Azure Content Delivery Network.

Automating Network Devices with NETCONF and YANG in Puppet Edge

This video offers a practical guide to automating network devices using YANG data models and the NETCONF protocol while using Puppet Edge. Gain the knowledge to streamline your network operations and enhance consistency. Perforce Puppet gives IT operations teams back their time and offers peace of mind with infrastructure automation that enables security and compliance.

The AI Workload Punishes Bad Habits

The AI workload presents the ultimate challenge, highlighting the structural limitations of the traditional hyperscaler model. In this segment from a Civo Navigate London 2025 session, Kelsey Hightower explains exactly why AI adoption forces enterprises to confront flawed architecture and rising astronomical costs. When specialized hardware is scarce and rented GPUs sit idle at a premium, it’s clear that traditional cloud providers were not built for this era. Data that didn't move is forcing organizations to move compute back to where it lives.

Searching Certificate Transparency Logs (Part 1)

Every TLS certificate issued by a root Certificate Authority (CA) ends up in one more more publicly accessible logs. These logs, collectively, make up the Certificate Transparency (CT) ecosystem. Unfortunately the logs are not very searchable. You can’t easily type in a domain and find all associated certificates. At CertKit we’re building CT monitoring capabilities to notify our customers when a new certificate is issued.

Leaner, greener business practices

Pulsant recently pledged to slash its carbon and other emissions as part of a thorough review of the entire business. Our goal is to halve all emissions by 2030 and achieve Net Zero by 2050 at the latest. This will require a continued and sustained effort. To be effective we will need to understand our connections to bring all our suppliers, vendors, clients, and of course our people, with us. Our ambition will be validated in accordance with the Science Based Targets Initiatives’ Net Zero Standard.

Pulsant Pledges to Reach Net Zero by 2050

“As the UK’s hybrid cloud specialist we are already helping clients reduce their environmental impact by ensuring the most efficient use of their technology infrastructure. I am really proud that this pledge to shift to Net Zero takes us, and our clients, to the next stage on this vital journey.” – Rob Coupland, CEO, Pulsant Pulsant is promising to achieve Net Zero by 2050, and earlier, if possible.

Shopware and Upsun expand strategic partnership to accelerate European eCommerce innovation and secure digital sovereignty

French and German leaders join forces in strategic partnership to bring flexibility and reliability to the European eCommerce market. After three years of successful collaboration, Shopware, the German-based European leader in open-source eCommerce, and Upsun, the French-based leading European Cloud Application Platform, are announcing a strategic partnership. Building on early success, with already 45 joint customers and growing, Shopware and Upsun are deepening their collaboration in 2026 and beyond.

What Are SQL Server Agent Jobs: Guide With Examples

Behind every reliable dashboard and morning report stands a system of accountability: SQL Server Agent jobs. They’re what keeps backups on time, analytics flowing, and data pipelines steady when everything else moves fast. However, to sustain that reliability, SQL Server Agent must operate with precision at scale.

Jira Service Management (JSM) Review for Incident Management (2025)

Atlassian is shutting down OpsGenie. New sales already stopped on June 4, 2025, and the platform will be completely offline by April 5, 2027. As an OpsGenie user, you now face a critical decision: Migrate to Jira Service Management (JSM), Atlassian’s recommended path, or choose a different solution. And if you’re not sure JSM is the right fit for your team’s incident management needs, this review will help you decide. I signed up for JSM and put it through real-world testing.

A CISO's preview of open source and cybersecurity trends in 2026 and beyond

Open source has come a long way. Recently I was watching a keynote address by our founder, Mark Shuttleworth, in which he discussed his vision for Ubuntu to provide quality support and security maintenance across the broad open source ecosystem, and it made me reflect on how far the open source software (OSS) community has come. Indeed, when looking at today’s interoperable open source landscape, the fragmented, disconnected landscape of the past seems like another planet.

The 3 AI Jobs That Didn't Exist 2 Years Ago!

People worry about AI taking jobs, but what about the new roles AI is creating? James Faure, CEO of Clairo AI, breaks down the three essential non-technical jobs that have emerged in the last two years: Prompt Engineers, Context Architects, and Evaluators. Learn the crucial skills needed to be highly employable in the future of AI.

Azure Synapse Explained: Analytics And Business Value

Traditionally, companies had to use separate tools for ETL, data storage, and analytics. Often, this resulted in slow, complex, and expensive data workflows A good example is PwC’s Deals, Insights & Analytics (DIA) team, which faced similar challenges. According to Microsoft, bespoke solutions often took months to build and were difficult to merge, slowing projects and driving up costs. That changed when PwC adopted Azure Synapse Analytics. The result?

Reimagining software delivery with AI-powered workflows in Jira & Bitbucket

If you’re like most developers, you know that writing code isn’t the bottleneck anymore. AI has made it faster than ever, and chances are you’re already using it. Yet, delivering software is still complex because of everything else you have to manage: fixing vulnerabilities, reducing tech debt, cleaning up feature flags, ensuring test coverage, writing documentation, and the list goes on. That’s why we built Rovo Dev, a context-aware AI agent for developers.

Dynamic Stage | Execute a Pipeline within a Stage !

The new Dynamic Stage allows you to import and execute an entire pipeline's YAML definition inside a single stage of your current pipeline. It is essentially running a pipeline within a stage. The pipeline YAML can either be generated and transformed at runtime in a previous stage, or be directly provided to the source input of the Dynamic Stage in encoded form. Dynamic Stages work seamlessly across Harness CI and CD modules.

Bloom filters: the niche trick behind a 16× faster API

This post is a deep dive into how we improved the P95 latency of an API endpoint from 5s to 0.3s using a niche little computer science trick called a bloom filter. We’ll cover why the endpoint was slow, the options we considered to make it fast and how we decided between them, and how it all works under the hood.

AI Table Stakes: The Enterprise Reality Check

This 5-minute critique pulls back the curtain on where AI is succeeding and where the biggest challenges remain. Experts expose the gap between market hype and reality: the failure to deploy fully autonomous production agents and the missing human-machine interface for non-developers. It’s a challenge to the entire industry.

How to Automate Change Management Evidence using Kosli and ServiceNow

Are your deployments getting stuck waiting for approvals? Your code is ready. Your tests are green. But your ServiceNow change ticket is still holding up the release. In most organizations, this isn’t a people problem or a process problem. It’s an evidence problem. Every release has to prove that it met the required checks — tests, scans, reviews, and approvals. But when that proof isn’t instantly available, everything slows down.

Storage and Story: Why Artifact Repositories Need Provenance

An artifact repository like JFrog Artifactory is a cornerstone of modern DevOps. It stores binaries, versions, and release bundles — your complete “what.” But when audits or incidents happen, the question quickly shifts from what to how: “How did this artifact get here — and can we trust it?” If all you have is a warehouse of files, you’re left scrambling to reconstruct the story. You check pipeline logs. You pull test results. You cross-reference approvals.

Need more juice? resources:set. Done.

Scaling your application shouldn’t feel like open-heart surgery. It should feel like flipping a switch. Watch your environment adapt in real time. Horizontal scaling. Vertical scaling. One command. Done. You do not want another war room. You want a clear way to add capacity when traffic increases, without editing and testing complex YAML files for hours or manually rolling out scripts across clusters.

Building the Future of Software Delivery Controls: Inside the FINOS SDLC Governance Working Group

In October, technologists from across the financial industry gathered in New York for OSFF 2025 where the general theme was clear: open collaboration has moved from promises to proof. Projects like Fluxnova and OpenGris showed how institutions can build shared, production-grade infrastructure. The Common Cloud Controls and AI Governance Framework demonstrated that regulatory assurance can be achieved collaboratively, not competitively.

Build Your Kubernetes Monitoring Foundation with kube-prometheus-stack

When you run Kubernetes at scale, one of the first challenges is understanding what the cluster is actually doing. Workloads shift around, pods restart for normal reasons, and traffic doesn't always follow the patterns you expect. Having clear signals makes day-to-day operations much easier. That's where kube-prometheus-stack helps. It brings Prometheus, Grafana, Alertmanager, and supporting components together as a single package.

Beyond Models: JFrog AI Catalog Evolves to Detect Shadow AI and Govern MCPs

When we first introduced the JFrog AI Catalog, it was our mission to provide the industry with a single system of record for governing the complex landscape of internal, open-source, and external commercial AI models. This foundational step was critical for enterprises to move from uncontrolled innovation to delivering AI with trust and confidence. However, the AI landscape is ever-evolving. The challenge for today’s enterprise is already evolving beyond simply managing a library of known models.

Securing Vibe Coding: JFrog Introduces AI-Generated Code Validation

A fundamental shift in software development is already here. Gartner predicts that by 2028, 75% of enterprise software engineers will use AI code assistants – a massive leap from less than 10% in early 2023. While this AI-driven speed creates a competitive advantage, it also opens a dangerous new front in the battle for software supply chain security.

Canonical Kubernetes officially included in Sylva 1.5

Sylva 1.5 becomes the first release to include Kubernetes 1.32, bringing the latest open source cloud-native capabilities to the European telecommunications industry With the launch of Sylva 1.5, Canonical Kubernetes is now officially part of the project’s reference architecture. This follows its earlier availability as a technology preview in Sylva 1.4.

An Open SDLC Controls Framework for Financial Services

How can financial institutions align on software delivery governance without slowing down innovation? At FINOS OSFF New York 2025, Deutsche Bank and Morgan Stanley introduced the new SDLC Governance Working Group — an open collaboration under FINOS to create a Common Controls Catalogue for software delivery. Kosli's Mike Long helped form and participates this group, contributing expertise in continuous compliance automation and controls engineering to connect the engineering and policy communities.

Redgate Monitor

Redgate Monitor helps you manage your entire database estate from a single pane of glass. Monitor SQL Server, PostgreSQL, Oracle, MySQL, and MongoDB – on premises, in the cloud, or in hybrid environments. Get database observability to proactively mitigate potential risks with instant problem diagnosis and customizable alerting. No downtime, customer complaints, or wake-up calls at 3am.

Introducing Redgate Test Data Manager with AI: Smarter, Safer Test Data Management

Discover how Redgate Test Data Manager’s new AI features deliver fast, compliant, production-like test data - balancing realism, speed, and security. In regulated industries like finance, healthcare, and insurance, test data management (TDM) can be quite challenging when it comes to compliance.

How to use Samsung Knox Mobile Enrollment (KME) to enroll your devices with AirDroid Business

Welcome to this tutorial video where we walk you through the process of using Samsung Knox Mobile Enrollment (KME). AirDroid Business is a mobile device management (MDM) solution that helps you manage all your Android devices effectively and securely. Follow us.
Sponsored Post

Cascading Failures Aren't Inevitable: Lessons from the AWS DNS Outage

AWS outages grab headlines because they affect millions, but the root cause often comes down to something invisible: DNS failures and cascading service dependencies. The complexity of modern cloud systems, combined with the advanced technology powering platforms like AWS, makes these outages particularly challenging to diagnose and resolve. The recent AWS outage proves one thing: you can't prevent every DNS issue, but you can create resilient architectures and prevent a single failure from taking down your entire service if you test for it.

Devart ODBC Drivers Get Major Update With GUI for macOS/Linux, PostgreSQL 18 Support, and Enhanced Security

We are thrilled to announce a major update across our line of ODBC Drivers, introducing a graphical configuration interface for macOS and Linux, extended authentication methods, and compatibility with the latest database versions including PostgreSQL 18 and MariaDB 12.

Part 3: Building a Production-Grade Traffic Capture and Replay System

At a previous company, we had over 100 microservices. I’d make what seemed like a simple change to one service and deploy it, only to discover it broke something completely unrelated. A change to the user service would break checkout. An update to notifications would break reporting. We spent more time fixing unexpected bugs than shipping features. The problem was our test scenarios were too simple.

Top DevOps Challenges in 2025 and How APM Solves Them

In 2025, DevOps continues to grow and change quickly, helping teams deliver software faster and more securely. But as systems become more complex with microservices, cloud platforms, and AI-driven tools, new challenges arise. Teams now need to balance speed with security, manage too many tools, control rising cloud costs, and still maintain high-quality software. This is where Application Performance Monitoring (APM) becomes essential.

Our Engineering in the Age of AI: 2026 Benchmark Report finds AI is making engineering faster, but not necessarily better

Everyone's talking about how AI is transforming software development. Teams are shipping more code, deploying more frequently, and getting features to market faster than they could a year ago. The productivity gains are real. But we kept hearing a different story from engineering leaders. Yes, velocity is up. But incidents are climbing, resolution times are getting longer, and code review processes are struggling to keep up.

Redgate Software recognized as a Strong Performer in Gartner Peer Insights Voice of the Customer for Infrastructure Monitoring Software

We’re thrilled to share that Redgate Software has been recognized as a ‘Strong Performer’ in the 2025 Gartner Peer Insights Voice of the Customer for Infrastructure Monitoring Tools category with our Redgate Monitor solution. We believe this recognition is a reflection of the trust and feedback from the people who matter most: our customers.

OTel Updates: OpenTelemetry eBPF Instrumentation (OBI) Hits Alpha

Some parts of a system don’t lend themselves to quick instrumentation changes. You might have a production binary that hasn’t been rebuilt in years, or a stack made of several languages where each team manages telemetry differently. In those situations, getting consistent signals often means touching code you’d rather leave alone or coordinating updates across many services. OpenTelemetry eBPF Instrumentation (OBI) approaches this from the kernel side.

How to debug an Android app in Anbox Cloud

Subscribe. Fuel your curiosity. In this video, the Anbox team demonstrates how to debug an Android application with Android studio in Anbox Cloud. What is Anbox Cloud? Anbox Cloud lets you run virtualized Android environments securely, at any scale, to any device letting you focus on your use case. Run Android in system containers, not emulators, on AWS, OCI, Azure, GCP or your private cloud with ultra low streaming latency.

Using Google Cloud Billing Tools For Cost Control

If you’ve ever opened your Google Cloud bill and felt confused, you’re not alone. Costs in GCP can spike up fast. One project here, a few APIs there, until the total looks nothing like what you expected. That’s because Google Cloud pricing is built for flexibility. You pay for what you use, across dozens of services, each with its own rules, discounts, and data charges. It’s robust, but also easy to lose sight of what’s driving your spend. The good news?

Your Cloud Economics Pulse For November 2025

Welcome to CloudZero’s inaugural Cloud Economics Pulse! This is our monthly snapshot of how cloud spend is evolving across providers, services, and emerging AI workloads. Each month, we’ll surface key trends, highlight where the money’s moving in cloud spend, and offer practical insight to help FinOps and cloud cost teams stay on top of things. October — and preceding months — shows a turbulent stretch for cloud economics.

The AI Visibility Problem: When Speed Outruns Security

Harness surveyed 500 security practitioners and decision makers responsible for securing AI-native applications from the United States, UK, Germany, and France to share findings on global security practices. The State of AI-Native Application Security 2025 dives deep into AI visibility and the changing landscape of security vulnerabilities. If 2024 was the year AI started quietly showing up in our workflows, 2025 was the year it kicked the door down.

How to Manage Grafana Access Groups for Team Control

Managing team access in Grafana can be tricky—especially as your organization grows. That’s where Grafana access groups (also known as Limited Access Groups in Hosted Graphite) come in. They allow you to define groups of dashboards and restrict which team members can access them. If you’re using Hosted Graphite with Grafana dashboards, this feature helps you organize teams, maintain data privacy, and simplify access control—all while giving users just the permissions they need.

Harness Commitment Orchestrator: A Modernized FinOps Experience

Harness has modernized Commitment Orchestrator to give FinOps teams a clearer, faster, and more intelligent way to manage cloud commitments. The redesigned experience unifies visibility across RIs, Savings Plans, on-demand, and Spot usage, with expanded metrics and streamlined workflows for smarter decisions. Built on a foundation for AI-powered insights, it helps teams optimize spend with greater confidence and less manual effort.

The Hidden Bottlenecks in AI Infrastructure (and How to Fix Them)

Artificial intelligence has entered an era where infrastructure is the real moat. Teams spend millions on GPUs, yet models still stall, latency spikes unpredictably, and throughput flatlines at 20% of what spec sheets promise. These hidden bottlenecks lurk far beneath the surface - in power grids, network fabrics, memory bandwidth, orchestration layers, and even governance policies. In this guide, we uncover where AI infrastructure actually breaks, what the emerging data and research reveal, and how Clarifai's reasoning and orchestration stack helps eliminate these unseen friction points.

Boosting Profitability with Ribbon Intelligent Solutions

Ready to Make Your Network Smarter? Ribbon’s Exclusive Preview Before Africa Tech Festival Attend to Learn About Ribbon's: Intelligent Automated IP Optical Networks Multilayer Automation & AIOps The Path towards Autonomous Networks Don’t miss this chance to see what’s next for service providers.

Improve Observability in Your CI/CD Pipeline

The backbone of modern software development is automation and at the heart of that lies the CI/CD pipeline. It’s what turns code into deployable software, delivering changes to users faster, safer, and more predictably. In simple terms, a CI/CD pipeline automates everything from the moment developers push code to when it reaches production. It integrates, tests, builds, and deploys software continuously ensuring faster releases with fewer human errors.

Add Postgres with one YAML line. Deploy in under a minute.

Let’s break down why this matters, and how it can change the way you approach building and running applications. You want database power without getting bogged down in tooling and config. Most of your week should be building features, not hunting for connection strings or maintaining bespoke infra scripts. Developers tell us they just want to code and solve application problems, with minimal platform friction.

Certificate revocation is broken but we pretend it works

Last week, someone commented on my post about 47-day certificates: This perfectly captures our collective delusion that SSL certificate revocation works. You click a button, the certificate stops working. And why wouldn’t we believe that? Every CA has a big “Revoke Certificate” button right there in the dashboard. It must do something, right? Here’s the dirty truth: most revoked certificates keep working.

Open source vs commercial AI: choosing the right path for your business

This blog is based on a presentation by Guillaume, Field Chief Technology Officer at Upsun, and Robert from Ilwiin Technology during the AI Action Summit. The original French presentation has been translated and edited for clarity and accuracy. The AI field is advancing significantly, presenting organizations with the question: Should they choose open-source or commercial AI models? This choice impacts everything from costs and data privacy to long-term business strategy.

The Hidden Side of AI: Building a Smarter Enterprise AI Solution

Everyone is talking about AI models, copilots, and large language engines. They’re certainly impressive, even transformative, but they’re only part of the story. The real power of AI depends on what’s happening behind the scenes. In enterprise environments, that hidden side of AI (the infrastructure, automation, and orchestration that make everything run) determines whether an AI strategy succeeds or fails. That’s where a smarter enterprise AI solution begins.

Densify Announces Kubex AI to Simplify and Democratize Resource Optimization

Densify has announced Kubex AI, a major leap forward in how organizations optimize complex Kubernetes and AI environments. This new solution combines verticalized AI for resource optimization with a conversational interface, empowering anyone—regardless of technical background—to access expert-level analytics and automation through simple, natural-language interactions.

Value Engineering Vs. Cost Cutting: The Value Paradox

For every CFO who’s ever asked, “Can we reduce engineering costs by 15%?” there’s a CTO quietly thinking, “But at what cost?” That tug-of-war plays out inside most SaaS companies. Finance wants to tighten budgets to protect margins, while engineering and product teams push for the resources to build, test, and innovate faster. The truth is, the teams that spend most efficiently — not necessarily the least — drive the highest returns.

Announcing HAProxy Unified Gateway (beta)

The continuous shift toward containerization means businesses are migrating more complex, mission-critical workloads to Kubernetes. This trend necessitates traffic management solutions that support diverse protocols (such as TCP, UDP, HTTP, and gRPC) and sophisticated organizational architectures, while delivering exceptional performance and efficiency.

GitKraken Desktop 11.6 Release: Shallow Cloned Repo Support

GitKraken Desktop 11.6 is here, shaped by real developer feedback. This release introduces support for shallow clones, a long-requested feature that improves performance for large repos and CI workflows. Simply open a shallow-cloned repo from the New Tab and you’re in. We’ve also made GitKraken AI more controllable and more contextual.

How Hyperscalers Use Credits to Keep You Hooked!

The hyperscaler model is built on bait: generous cloud credits for years, especially if you're VC-backed. But there's a serious catch. In this clip, Canopy's James Marks talk about the expected consequence of taking those "free" credits. It's not just about attracting customers; it's about deepening reliance on proprietary platform-native tools. It’s the ultimate vendor lock-in strategy, making it costly and complicated to break away later.

How companies in India are using Civo to improve their cloud costs and data sovereignty

As the Indian cloud market continues to grow, businesses are increasingly looking for ways to manage their cloud costs effectively while ensuring data sovereignty. At Civo, we've seen firsthand how our cloud and AI platform can help companies achieve these goals. In this blog, we'll explore how three of our customers in India - KubeNine, BeezLabs, and OpsMx - have leveraged Civo to improve their cloud costs and data sovereignty.

When Cloud Providers Have an Outage, Your Feature Flags Shouldn't

Cloud outage? Your flags should keep running. Harness Feature Management ensures seamless feature delivery with instant SDK fallback, local decisioning, and a globally distributed streaming architecture—no redeploys required. Over the past few weeks, the software industry has experienced multiple cloud outages that have caused widespread disruptions across hundreds of applications and services. When systems went down, the difference between chaos and continuity came down to architecture.

Harness AI October 2025 Updates: Smarter Pipelines, Instant Troubleshooting, and Memories

The AI Velocity Paradox is real. While teams are writing code faster than ever, they're hitting a wall downstream. Deployments are failing. Security vulnerabilities are slipping through. Manual toil is eating up whatever time developers saved with AI-assisted coding. The speed boost from one part of the software delivery lifecycle is being strangled by legacy processes in another. Harness is solving this the only way that works: by bringing intelligent AI deeper into the delivery process itself.

What Hiring Managers Look for in IT and DevOps Candidates

Are you hiring for a new position in your IT department? The search for the right IT talent has become a daunting task. You need to understand that every hire influences your efficiency and team dynamics. This will have a knock-on effect on how your organization can innovate. Candidates who stand out always possess two distinct qualities. They have the perfect combination of technical mastery and clear communication skills. This creates an immediate impact on your projects and processes.

Treat Game Localization as Code - DevOps Guide 2025

Every DevOps engineer has lived the nightmare. Launch day. 3 AM. The Korean build still says "Press X to Pay Respects". The fix requires re-exporting 42 Excel sheets, re-signing the build, and praying Apple approves before the internet explodes. 68 % of delayed game updates in 2024 came from localization chaos (Game Dev Ops Report 2025). That's not just late patches - that's real revenue bleeding out.

IBM TechXchange 2025 Takeaways: Key Insights for IT Leaders

This year at IBM TechXchange 2025, we had the privilege of not only attending but also sponsoring the event and hosting a booth in the expansive expo hall. From the moment we arrived, it was clear: IBM’s ecosystem is thriving once again. Between the buzz of innovation, the depth of technical sessions, and the sheer energy of the crowd, TechXchange 2025 stood out as one of the most impactful IBM events in recent memory.

Don't pay for metrics, pay for change: A modern guide to engineering metrics

Businesses today have more access to information about their products and engineering teams than ever before, and the push to be data-driven is also at an all-time high. Engineering metrics can provide actionable insights that help accelerate technology and business impact.

OpenTelemetry Metrics in Quarkus Explained

When you run services on Quarkus, you need a steady stream of signals to understand how the application behaves—CPU trends, request timings, memory patterns, and how each endpoint responds under load. Metrics give you that visibility. They help answer questions like: OpenTelemetry fits well here because it gives Quarkus a common way to generate and export metrics without locking you into a specific monitoring tool.

Why 2025 & Beyond is The Builders Era

The tech world loves buzzwords. We’ve lived through the Cloud Era, the Mobile Era, the AI Era (we’re still in that one, apparently). But 2025 marks something different. Something developers have been craving for years but couldn’t quite name. Welcome to The Builders Era. Not because of some shiny new framework or yet another platform promising to 10x your productivity. The Builders Era is happening because developers are done being spectators in their own craft.

Improve Kubernetes reliability faster with Gremlin and Dynatrace

It’s now easier than ever to start testing Kubernetes with Dynatrace and Gremlin. With a new strategic integration, Kubernetes services set up in Dynatrace are automatically discovered in Gremlin to make testing set up simple and fast. At a time when AI is driving massive expansions in infrastructure and dramatically increasing deployment speed, being able to set up and test new services quickly is more important than ever. ‍

What Happens When You Mix AI With Docker?

Discover how Docker is empowering developers in the GenAI era with tools that simplify AI application development. Docker VP of Product Michael Donovan shares how containers are critical for building, testing, and scaling GenAI applications, plus real solutions for the biggest challenges developers face today.

Building dbRosetta Using AI: Part 3, Creating a Database

The AI said I had to do a database first, not code. Who am I to argue? So, with all the prompts outlining the goals of the project, I’ve gone forward with the project, and step one is creating a PostgreSQL database on Azure. This is part three of a multi-part set of articles. I’ll move this list to the bottom of future articles: Part 1: Introducing the Concept of dbRosetta Part 2: Defining the Project & Prompt Templates.

CloudZero: Making Kubernetes Costs Transparent And Actionable

Kubernetes is now the backbone of modern software infrastructure, helping teams deploy, scale, and manage applications efficiently across clouds. But when it comes to understanding costs, Kubernetes remains opaque. Teams often can’t answer basic questions like: How do you solve the gap between engineering usage and financial visibility? CloudZero’s new Kubernetes capabilities are built to address this challenge.

Data Center Vacancy Rates at an All Time Low: What Can You Do?

Data center vacancy rates in North America have hit record lows, with reports from CBRE and JLL indicating figures between 1.6% and 2.3% as of mid-2025. This is driven by exceptionally high demand from hyperscale and AI users, which is outstripping supply and leading to significant competition for space and power. The tight market is expected to continue through at least 2027, with preleasing of new construction at high levels.

Megaport and Latitude.sh: Bringing Compute and Connectivity Together

Megaport has entered into an agreement to acquire Latitude.sh, creating an industry-leading Compute and Network-as-a-Service platform to power high-performance applications and AI workloads globally. At Megaport, we’ve always believed infrastructure should be simple to use, powerful at scale, and flexible enough to follow the workloads that matter most.

Cloud Credits: The Hidden Lock-In Strategy Hyperscalers Use

In this 5-minute clip from our recent webinar, Canopy's James Marks exposes the most dangerous side-effect of the cloud credit model: the migration loop. Instead of building their product, companies spend months hopping between vendors to chase new credits, falling into a cycle of constant, costly re-architecting. Simon Hansford provides clear advice for the best companies: build your architecture for portability on day one. Restrict proprietary features to maintain optionality and avoid the "entrenched phase.".

WordPress Vanilla vs Composer vs Bedrock - which wins?

WordPress powers over 40% of the web, but not all WordPress installations are created equal. Whether you're a solo developer, managing an agency, or overseeing hundreds of sites, the way you install and manage WordPress can make or break your workflow. In a recent live stream discussion, we dived deep into three popular WordPress installation methods: Vanilla, Composer-based, and Bedrock. Each approach has its merits, but which one should you choose? Let's break down the showdown.

Jira Service Management (JSM) Review for On-Call Management (2025)

OpsGenie is shutting down. And Atlassian recommends migrating to Jira Service Management (JSM). But if you’re not sure JSM is the right fit for your team’s on-call management needs, this review will help you decide. I signed up for JSM and put it through real-world testing. I created on-call schedules, rotations, and overrides. Then, I reviewed JSM’s on-call management across 4 key criteria. For each criterion, I shared what I liked and what I didn’t.

How Rootly works with Slack | An end-to-end demo.

Rootly is the AI-native on-call and incident management platform that helps you resolve incidents faster, improve system resilience, and streamline on-call operations. It’s your always-on SRE copilot that automates root cause analysis and identifies patterns that drive continuous improvement—trusted by thousands of companies like LinkedIn, NVIDIA, Replit, Elastic, Canva, Clay, Tripadvisor, and Grammarly.

Latency, Loneliness, and Laundry: A Practical Field Guide to Remote Ops That Actually Feels Good

Remote ops is weird. You're juggling alerts, releases, tickets-and five meters away there's a pile of laundry silently negotiating your willpower. You want focus without turning into a hermit. You want flexibility without drifting into 11 p.m. "just one more thing" spirals. And you want your team to feel like a team, not just avatars in a status channel. This guide blends human factors with ops pragmatism. Short, testable ideas. Minimal ceremony. A little empathy for the person behind the keyboard.
Sponsored Post

Preparing for cloud failures: Monitoring strategies for distributed hybrid infrastructure

When AWS experienced its recent outage, the ripple effect was immediate. Critical workloads slowed, dashboards went blank, and many teams realized multi-cloud isn't automatically resilient. Cloud-level failures are inevitable due to the interdependent components and complex IT architecture. The recent AWS disruption reminded many teams that the cloud isn't a magic uptime guarantee. Even the most mature providers can-and do-experience large-scale service interruptions.

Devart ODBC Drivers vs Free ODBC and JDBC: Key Comparison

Most teams never question the JDBC or ODBC drivers they use. If it connects, it’s “good enough.” That assumption can cost more than $14,000 per minute during an outage, according to EMA’s 2024 IT downtime benchmark. Drivers are more than connectors. They dictate how efficiently data moves between databases, applications, and analytics tools. When overlooked, the entire stack slows down. Breakdowns at this level lead to failed reports, missed deadlines, and avoidable downtime.

Service Observability, Service Operations and Service Orchestration: Unifying Visibility and Action Across the Enterprise

For large enterprises, the health and resilience of Business Services define customer experience and business reputation. Yet as technology estates grow in complexity, fragmented toolsets and siloed teams make it difficult to maintain service availability and prevent incidents before they impact the business and ultimately, customers.

What Is BigQuery? A Guide To How It Works And Costs

Data has exploded — and so have the challenges that come with it. Every click, transaction, and sensor ping generates mountains of data that traditional databases can’t handle. That’s why more than 94% of organizations now rely on cloud platforms, according to CloudZero’s 2025 cloud report. The goal isn’t just to store data, but rather, to make sense of it fast. And this is exactly where tools such as Google BigQuery step in.

Streamline Incident Management with the New Netdata-ServiceNow Integration

When a critical alert fires at 2 AM, the last thing your on-call engineer should be doing is manual administrative work. Yet, for many teams, that’s exactly what happens. You see the alert in your monitoring tool, then you have to switch contexts, open a new browser tab, log into your ITSM platform, and manually create an incident—all while your systems are failing.

Reliability lessons from the 2025 AWS DynamoDB outage

On October 19th and 20th, 2025, the AWS region US-EAST-1 suffered a massive outage. What started with a 3-hour Amazon DynamoDB outage from a DNS issue led to an Amazon EC2 outage that lasted an additional 12 hours before normal service was restored. Over the course of the outage, there were over 17 million outage reports as companies like Snapchat, Roblox, Amazon, Reddit, Venmo, and more were impacted.

New Feature Friday: AI Readiness and AI Maturity

Everyone wants to move faster with AI. But are you ready for it? In this Feature Friday, Jeff from Cortex shares how working with AI tools like Claude helped him write better code — and why true AI maturity starts with solid engineering hygiene. You’ll learn: “With great power comes great responsibility… and better tests.".

From rollouts to results: Unlocking the value of Feature Management and Experimentation

Unlock Faster, Safer Releases with Feature Management and Experimentation Learn how top engineering and product teams use Harness Feature Management & Experimentation (FME) to accelerate innovation, reduce release risks, and continuously deliver value. In this on-demand webinar, Harness experts Alex Bock and Iram Khan share how to go beyond feature flags to achieve smarter, data-driven releases. Discover how to.

3 Ways to Embed Digital Strategy into DevOps and IT Operations

Let's be honest, in most companies, the people who handle "digital strategy" and the ones who keep the systems running barely speak the same language. The strategy folks are talking about growth, engagement, customer journeys. The ops teams? They're buried in uptime reports, patch schedules, and incident tickets. Somewhere in the middle, the actual connection between the two gets lost.

Speedscale Proxymock: Freely testing cloud native apps alongside AI code assistants

We’ll always remember 2025 as the year AI code assistants went big. Copilot, Cursor, Claude, Windsurf, whatever. Developers went from mistrusting these tools, to being expected to turn over much of their coding labor to them. Even if, according to an extensive Stack Overflow survey, only 3 percent of professional developers say they ‘highly trust’ AI coding tools.

How to Optimize Azure Costs and Improve Cloud Efficiency with FinOps

In this episode of the FinOps on Azure podcast, Dustin Mullenix from KPMG talks about his role leading a FinOps team that handles Azure spend for KPMG's audit business. He shares how internal FinOps teams work with consulting groups, the challenges of Azure's service tiers like managed disks and SQL options, and the trade-offs between cost and performance. Dustin discusses how to handle cost changes, manage knowledge in a big company, this talk is useful for anyone dealing with cloud costs in a large setup.

You're Late to the OpenTofu Party. Here's Why That's a Problem.

OpenTofu has emerged as the true open successor to Terraform, restoring transparency and community ownership after Terraform’s shift to a restrictive BSL license. With features like OCI registries, encryption at rest, and a public RFC process, it’s already outpacing Terraform’s innovation.

How to Use MetricFire Logging: Visualize Logs & Metrics Together in Grafana

Want full visibility into your systems? In this step-by-step tutorial, we show you how to use Grafana Loki with Promtail on Hosted Graphite by MetricFire to stream logs alongside your metrics. All visualized in Grafana dashboards. No more toggling between tools — get the full observability stack in one place.

Deploying Dgraph Clusters to Cycle

One of the best parts of my job is helping Cycle users explore self-hosting options on the platform. This time, I had the pleasure of working with Dgraph (now a part of Hypermode). If you haven't heard of it, Dgraph is a distributed, horizontally scalable graph database that gives you a native graph storage/compute engine with distributed ACID transactions (via Raft and snapshot isolation) and first-class GraphQL.

Building smarter with AI: Why legacy infrastructure is the biggest bottleneck

Josh Mesout (Chief Innovation Officer at Civo) took the main stage at Civo Navigate London 2025 to deliver a critical message: The AI revolution isn't just coming, it's here, and the way companies are built is changing faster than ever before. His session cut through the hype, delivering hard data on what separates the companies that scale AI from the ones that sink money into failed prototypes. The takeaway is blunt: The biggest threat to your AI ambition isn't the model; it’s your infrastructure.

The High Cost of Vendor Lock-In in Cloud Computing and How to Avoid it

Cloud vendor lock-in threatens agility and raises costs. Discover the high price of proprietary services, egress fees, and technical entrenchment, plus the strategic roadmap to escape. Learn how embracing open standards, Kubernetes, and an exit strategy from day one ensures long-term flexibility and control.

Introducing Braintrust by Cortex: Real conversations about engineering excellence

I'm excited to share that we're launching Braintrust by Cortex, a podcast I've been wanting to create for a while now. After years of building Cortex and working with engineering leaders across industries, I’ve noticed an opportunity for conversations that dig deeper into the daily work of building great engineering teams.

How Prometheus Exporters Work With OpenTelemetry

Running distributed systems means you need clear visibility into how your services behave. Prometheus has been the standard for metrics for a long time, and OpenTelemetry is now giving teams a more consistent way to collect telemetry across their stack. In many setups, you'll have both: existing Prometheus instrumentation that's already in place, and new components instrumented with OpenTelemetry.

Autonomous Self-Healing Capabilities for Cloud-Native Infrastructure and Operations

Modern cloud-native infrastructure was adopted to increase agility and scale, but as it grows in scale and complexity, engineering teams are now drowning in operational noise. Industry research (The State of Observability for 2024) reveals that 88% of technology leaders report rising stack complexity, while 81% say manual troubleshooting actively detracts from innovation.

Hyperview DCIM 5.2 Software Release

This release focuses on giving you more control over your infrastructure connections and ensuring your monitoring tools run smoother than ever. From enhanced circuit management and expanded search capabilities to optimized data collectors and advanced Modbus support, this update delivers practical improvements that make your day-to-day operations more efficient.

Hyperview DCIM 5.1 Software Release

This release is all about helping you move faster, see more, and manage your infrastructure with greater ease. From real-time polling and smarter layout tools to expanded support for DC power and new visual enhancements in rack views, this update is packed with practical improvements. Plus, with French language support and key bug fixes, it’s more accessible and reliable than ever.

Navigating the path from startup speed to enterprise scale | Braintrust by Cortex

(00:20) The founding journey: from zero to one hundred customers(01:13) Day one: Building the first version of the service catalog(04:25) Why speed is a startup's only superpower(09:54) The mindset shift to enterprise-grade reliability and scale(13:06) How quality becomes a competitive advantage(14:46) High-leverage early decisions: writing tests and supporting on-prem(17:38) Balancing speed and quality in the age of AI(21:21) How AI will shift, not replace, engineering roles(26:53) Advice for engineering leaders working with founders.

Simple Talk Podcast - Coffee Chat with John Sterrett

Simple Talk Podcast – Coffee Chat with John Sterrett Description: Steve chats with John Sterrett, CEO of ProcureSQL, about his true love for data from a young age, how SQL Saturday and community events inspired him to start his own company, ProcureSQL’s use of AI to provide more value, and the impacts of work on relationships - plus much more!

Deploying a SolidStart app to Vercel with CircleCI

Deploying web apps can feel overwhelming. Multiple moving parts, including frameworks, hosting, databases, and automation tools make having a smooth, automated workflow seem impossible. But having an automated workflow is worth the effort; you can focus on building features and improving your app instead of worrying about manual deployments or server management.

Faster Approvals: ServiceNow + Kosli, How to Automate Compliance Evidence for Change Management

Are your deployments getting stuck waiting for approvals? If your code is ready but your ServiceNow change ticket isn’t — the bottleneck might not be people or process. It’s missing evidence. In this video, Matt Bailey (Product Manager, Kosli) shows how ServiceNow and Kosli work together to automate compliance evidence collection — turning manual approval bottlenecks into seamless, audit-ready change management.

Storage and Story: JFrog Artifactory + Kosli, How to Prove Where Your Artifacts Came From

Your artifact repository tells you what software is stored — but can it tell you how it got there, and who approved it? In this video, Matt Bailey (Product Manager, Kosli) shows how JFrog Artifactory and Kosli work together to give you both storage and story. While Artifactory stores your binaries, Kosli automatically builds an immutable chain of custody — recording every commit, build, test, and deployment that led to your artifact in production.

Faster Approvals with ServiceNow + Kosli, How to Automate Compliance Evidence for Change Management

Are your deployments getting stuck waiting for approvals? If your code is ready but your ServiceNow change ticket isn’t — the bottleneck might not be people or process. It’s missing evidence. In this video, Matt Bailey (Product Manager, Kosli) shows how ServiceNow and Kosli work together to automate compliance evidence collection — turning manual approval bottlenecks into seamless, audit-ready change management.

The Ultimate Azure FinOps Guide: From Visibility to Optimization

In this episode of the FinOps on Azure podcast, host Michael chats with Anderson Oliveira, author of "FinOps Ultimate Guide for Azure." Anderson, a chemical engineer turned IT project manager, explains his journey into FinOps and breaks down his book, which focuses on organizing teams, measuring costs, tracking value, and building efficient practices on Azure. They discuss common challenges like stakeholder friction, the shift from CapEx to OpEx, and why focusing on value over just cutting costs is key for long-term success.

You Can't Fix What You Don't Measure: Observability in the Age of AI with Conor Bronsdon

Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to unpack why observability, monitoring, and human feedback are the missing links to make large language model (LLM) reliable in production.

Access Real-Time Infrastructure Data With Puppet Infra Assistant #puppet #itautomation #aiops

Puppet Infra Assistant allows easy access to real-time infrastructure data to make faster data driven decisions. Be future-ready and effortlessly access vital infrastructure data with Puppet Infra Assistant; no Puppet experience required.

Easy JIRA Automation Anyone Can Do in Minutes

Tired of manually updating Jira issues after every change? Want to avoid the dreaded "what's the status of XYZ ticket" question? In this video, we’ll show you how to automate Jira updates using Smart Commits and Jira Automation Rules — so your issues always stay up to date without you lifting a finger. We’ll cover what they are and how they work, then walk through how to incorporate them into your workflows.

Azure VM utils now included in Ubuntu: boosting cloud workloads

Ubuntu images on Microsoft Azure have recently started shipping with the open source package azure-vm-utils included by default. Azure VM utils is a package that provides essential utilities and udev rules to optimize the Linux experience on Azure. This change results in more reliable disks, smoother networking on accelerated setups, and fewer tweaks to get things running. Here’s what you need to know.

Replication Job Monitoring Support in Redgate Monitor

Whether it’s a stalled Log Reader Agent, a conflicting insert on the subscriber, or a failed cleanup job bloating the distribution database, Redgate Monitor now brings SQL Server replication issues to light early, before performance or reliability are affected. In many SQL Server environments, replication remains essential for offloading reporting and analytics workloads, or for maintaining local and synchronized data copies across regions.

Densify Releases New MCP Server to Bring AI-Driven Resource & GPU Optimization to Platform Teams

As excitement builds for KubeCon North America 2025 in Atlanta, Densify has released its latest innovation for Kubernetes and AI-driven infrastructure resource management: the Densify Model Context Protocol (MCP) Server. This new capability enables organizations to securely integrate Densify’s Kubex resource optimization intelligence directly into popular LLM-powered tools — including ChatGPT, Claude, Cursor, and Gemini CLI.

Building dbRosetta Using AI: Part 2, Defining the Project & Prompt Templates

This is the next installment of the series on building a database and an application called dbRosetta using AI/LLM. Part 1 introduces the concept. THE AI PICKED DATABASE FIRST! Look, I talk databases at this thing a lot, so it probably knows my own preference, but when I asked it, it chose to build a database separate from the code. Let’s get into it.

What Are AI Guardrails

When you're shipping LLM features, a lot of the work goes into keeping the model's behavior predictable. You deal with questions like: These are everyday concerns when you integrate LLMs into production systems. Guardrails AI provides a Python framework that helps you enforce those expectations. You define the schema or constraints you need, and the framework validates both the inputs going into the model and the outputs coming back.

AI Eliminates Pollution Risk: Oxford's Digital Contrast, Powered by Civo.

The future of medicine is here: Oxford's digital contrast AI is powered by Civo! Watch as Regent Lee, Professor at the University of Oxford and moonshot engineer, reveals a revolutionary solution to healthcare’s biggest hidden problem. Radiology currently accounts for 1% of global carbon emissions, with a single PET CT scan generating up to 60 kg of carbon, while forcing patients to endure long waits and chemical injections. Old habits cause slow systems.

The AI Knowledge Agent: Making Internal Developer Portals Smarter

AI is generating more code than ever, but delivery hasn’t kept pace. The Harness IDP Knowledge Agent helps teams close that gap by turning their internal developer portal into an intelligent platform for faster, safer software delivery. Joining a new engineering team can be exciting, but it can also be overwhelming. You spend the first few days figuring out what each service does, where documentation lives, and who owns what.

Running Chaos Engineering on GKE Autopilot Just Got Easier

Harness Chaos Engineering now runs natively on GKE Autopilot. A simple allowlist configuration enables you to test resilience on Google's managed Kubernetes without sacrificing security or requiring workarounds. Google's GKE Autopilot provides fully managed Kubernetes without the operational overhead of node management, security patches, or capacity planning. However, running chaos engineering experiments on Autopilot has been challenging due to its security restrictions. We've solved that problem.

SharePoint Storage Limit Warning

When your Microsoft 365 tenant reaches the SharePoint storage limit, the impact is immediate. File uploads start failing, Teams sites stop provisioning, indexing slows down, and storage overage charges begin applying automatically. For organisations storing large volumes of documents, drawings, media files, or project data, hitting the SharePoint capacity threshold can become a recurring and expensive problem—especially when underlying retention policies prevent deletion.

Harness recognized as a top partner in software delivery success

We’re thrilled to share some exciting news: Harness has once again been named to Inc.’s 2025 Power Partners list, recognizing companies that go above and beyond to help their customers succeed. This marks our second year in a row on the list — and while awards are always nice, this one feels special because it celebrates something deeper than technology. It’s about the relationships, trust, and collaboration that power real outcomes for our customers every day.

Do You Need DCIM Software If You Already Use a BMS?

A Building Management System (BMS) is commonly used in data centers to monitor and control the facility’s mechanical, electrical, and environmental systems. With a BMS in place, it’s reasonable to ask: do you still need Data Center Infrastructure Management (DCIM) software? The short answer is yes—DCIM and BMS serve occasionally overlapping but fundamentally different purposes.

Audit Ready Artifacts: JFrog Artifactory + Kosli, How to Prove Where Your Artifacts Came From

Your artifact repository tells you what software is stored — but can it tell you how it got there, and who approved it? In this video, Matt Bailey (Product Manager, Kosli) shows how JFrog Artifactory and Kosli work together to give you both storage and story. While Artifactory stores your binaries, Kosli automatically builds an immutable chain of custody — recording every commit, build, test, and deployment that led to your artifact in production.

What we learnt about digital sovereignty at Civo Navigate London 2025

The concept of digital sovereignty has become increasingly important in today's technology-driven world. As organizations rely more heavily on cloud services and artificial intelligence (AI), they face new challenges in maintaining control over their data and IT resources. At Civo Navigate London, we brought together industry leaders to discuss the topic of digital sovereignty and its implications for the cloud industry.

When Automation Finally Flows: Eliminating the Layers Between AI and IT

Enterprises like yours have sunk considerable time and money into trying to stitch together automation. This pattern is always the same: someone buys a variety of IT tools, and a small team of specialists spends months wiring them together. One tool tries to understand human language while another launches a workflow. Between them sits a tangle of mappings, connectors, triggers, and “custom glue” that only one engineer understands. That’s not automation, though.

Incident Management and Response

In this video, discover how Cortex transforms incident management by automating key processes, reducing response times, and providing real-time visibility into your engineering ecosystem. With seamless integrations and AI-powered insights, Cortex helps teams go from reactive to proactive, improving reliability and accelerating recovery.

How to Improve Your Microsoft ExpressRoute Resilience with Megaport Connectivity

Improve ExpressRoute reliability with these deployment models and strategies for stronger cloud resilience, powered by Megaport. Every year, businesses become even more reliant on their network for the success of their entire operations. For the 350,000+ companies using Microsoft Azure, building resilient, reliable network connectivity to this service is essential.

How the CLI Simplifies Git Workflows

Managing 50 repos with different Git workflows is like conducting an orchestra where every musician is playing a different song. We built GitKraken CLI to fix this: multi-repo actions, a single command for all things Git, and team-wide standardization that doesn't sacrifice speed. Watch the full video on how we're simplifying complex workflows without dumbing down Git.

Grafana Tempo: Setup, Configuration, and Best Practices

As systems grow, understanding how a request moves across multiple services becomes harder. Traces help bring this picture together by showing the exact path a request takes, along with the timings that matter. Grafana Tempo is built for this kind of workload. It stores traces efficiently, works well with OpenTelemetry, and keeps the operational overhead low.

Orbital Materials: WorldClass AI Models Built on CivoStack

Daniel Miodovnik, COO of Orbital Materials, explains how the CivoStack enables world‑class AI models that outperform the big‑tech giants. He outlines the power‑draw and cooling of megawatt‑scale GPU racks, the water‑ and CO₂‑intensity of today’s data centres, and why a sovereign, Civo‑based solution is the key to speed, and predictable costs.

What Is AWS Step Functions? A Complete Guide

Imagine you are building an e-commerce app. Every time a customer places an order, a lot happens behind the scenes. For example, you need to charge their card, update inventory, create a shipping label, and send a confirmation email. You could try to write one giant program that does everything in the correct order, but that quickly becomes a tangled mess — especially if something fails halfway through (say, payment succeeds but inventory update fails).

Announcing CloudZero's Oracle Cloud Connector: Real Cost Intelligence For AI And High-Performance Workloads

For years, enterprises have turned to Oracle Cloud Infrastructure (OCI) for what it does best: powering mission-critical applications with unmatched performance, security, and predictable economics. OCI has historically staked its reputation on being the go-to platform for organizations running complex, data-intensive workloads, from core databases and ERP systems to large-scale compute clusters, while putting extra focus on security and predictable pricing.

Developer Onboarding

Welcome to the future of developer onboarding with Cortex. In this demo, you’ll see how Cortex helps new engineers ramp up faster by giving them instant access to everything they need—context, ownership, best practices, and workflows—all in one place. What you’ll learn in this video: With Cortex, onboarding becomes a structured, data-driven, and empowering experience. Developers can explore your ecosystem confidently, follow golden paths, and start delivering value immediately—reducing ramp-up time from months to days.

MTBF, MTTR, MTTF, MTTA: Incident Metrics Explained

No doubt that incidents are inevitable. However, it’s how you manage them (detect, respond to, and resolve) that matters. And a robust incident management process relies on data, not guesswork. Incident Management metrics like MTBF, MTTR, MTTF, and MTTA provide measurable insight into reliability, response time, and recovery performance. When used together, they help identify weaknesses, reduce downtime, and build more resilient systems.

Harness patent for hybrid YAML editor enhances CI/CD workflows

Harness earned a patent for it's unified pipeline editor which makes it easy to configure pipelines whether they are for CI, CD, IaC, database migrations, service onboarding or other DevSecOps activities. ‍ We're thrilled to share some exciting news: Harness has been granted U.S. Patent US20230393818B2 (originally published as US20230393818A1) for our configuration file editor with an intelligent code-based interface and a visual interface.

SRE vs DevOps vs Platform Engineering: What Are the Key Differences

Software delivery is more complex than ever. Teams need speed, reliability, and scalability to stay competitive. Site Reliability Engineering (SRE), DevOps, and Platform Engineering are three key disciplines that address these challenges. Though these terms are often used together, they are not the same and share distinct differences. In this blog, we’ll discuss each term individually, compare SRE vs. DevOps vs. Platform Engineering, and also show how they work together.

Observability vs. Monitoring: What's the Difference?

Modern systems are complex, distributed, and fast-changing, so keeping them reliable requires more than watching dashboards. Observability vs. Monitoring explains how teams gain the deep insight needed to detect, diagnose, and resolve issues. Monitoring collects predefined metrics and alerts you to known problems, while observability provides rich, contextual telemetry to investigate unknown failures.

When Breaches Expose Your Secrets: Why Automation is the Key to Fast, Scalable Remediation

In early October, Red Hat disclosed a breach of a GitLab system used by its Consulting division. Threat actors claim to have exfiltrated hundreds of gigabytes of project data — and while investigations are still underway, reports suggest consulting engagement artifacts may have been impacted. For the organizations involved, the concern isn’t limited to reputational damage.

Stop Getting Charged for Test Emails! #speedscale #aws #ses

Tired of local development using AWS SES leading to spam, cloud costs, and unnecessary API calls? When testing your app, you shouldn't have to connect to a live cloud environment just to send a test email. Learn how to set up your own local ProxyMock server to intercept and record real SES calls, so you can replay them instantly and accurately without ever leaving your desktop.

MOCK AWS SES Locally! Stop Sending Test Emails & Cut Cloud Costs

In this quick guide, Speedscale's Matt LeRay shows you how to free your local development environment from direct AWS SES dependencies. When your application sends an email during local testing, it usually triggers a live AWS transaction, leading to slow tests, unnecessary cloud costs, and sometimes even spam filter issues.

Compliance Under the Microscope

I wanted to share a story of a recent engagement with a law firm to highlight the strategic importance of compliance in today’s legal sector. It started with a single email. A mid-sized law firm received a regulator’s request for evidence following a client complaint. The issue wasn’t malpractice; it was a missed filing deadline caused by a system slowdown. The firm had no audit trail to prove the delay was technical, not procedural.

How to Optimize GPU

The Problem: AI workloads are dynamic, unpredictable, and expensive. Data prep can choke your pipeline, training jobs hog GPUs without awareness, and inference, the most latency-sensitive phase, is notoriously hard to scale efficiently. Worse, traditional infrastructure tools treat GPU as a static commodity, ignoring model intent, workload shape, and sharing capabilities.

How to build the ideal engineering team dashboard

Most developers spend too much time digging through tabs and switching between tools, rather than actually writing code. According to an IDC survey, only 16% of their week goes to coding, while the rest is lost to what researchers call “organizational inefficiencies” – all those little things that slow teams down.

New Devart Python Connectors Add Broader Compatibility and Stronger Security

We are thrilled to announce a major update to our Python Connectors line. The release adds support for Python 3.14, PostgreSQL 18, MySQL 9, and MariaDB 12, introduces modern authentication and security options, and delivers notable performance gains across several connectors.

How IT teams can finally break free from manual AD management

If there’s one thing every IT leader can agree on, it’s this: Manual Active Directory (AD) management never ends. There’s always one more access request, one more approval chain, and one more audit reminder flashing on your screen. By the time you’ve closed your last ticket of the day, there’s already another one waiting. For many teams, 2025 became the year of “we’ll automate next quarter.” But next quarter came and went without any automation.

OTel Updates: Declarative Config - A Steadier Way to Configure OpenTelemetry SDKs

Application configs change over time, often in small ways that are easy to miss. They may start simple — a few environment variables, one exporter, nothing unexpected. As your instrumentation grows, you add rules for filtering health check spans, adjust sampling based on attributes, or introduce environment-specific resource settings. Each change makes sense on its own. But months later, the picture can look different across dev, staging, and production.

Managing Alerts: Car Alarms and Smoke Alarms

Building and shipping an application is exciting, you watch your idea come alive and reach users. But once it’s out there, your real job begins: keeping it alive. An app in production isn’t just code running, it’s a living system. It needs monitoring to stay healthy and alerting to warn when something’s off. But there’s a catch: too few alerts, and you’ll miss real issues; too many, and you’ll drown in noise.

The Outage Anxiety Test: Can You Answer These 3 Questions In Under 10 Minutes?

On Oct. 20, the Internet woke up and seemingly chose violence. For more than 12 hours, Amazon Web Services (AWS) went down. From banking platforms to hospital communications to mobile ordering apps, digital services came to a screeching halt. The cause? Two programs are trying to write a DNS entry simultaneously, failing, and leaving the entry blank. Thus began the incredibly costly failure cascade.

AI And Sustainability: Measuring The Impact Of The Generative AI Boom

Before 2022, Alex Hanna worked on Google’s Ethical AI team. Today, she’s the director of research at the Distributed AI Research Institute, a transition sparked by Google’s handling of a paper exposing AI’s growing environmental footprint. So, how bad is it, really? That depends on who you ask. Take Jesse Dodge, a senior research analyst at the Allen Institute for AI. Jesse told NPR that a single ChatGPT query can use as much electricity as keeping a light bulb on for 20 minutes.

Rovo AI: Create Work Items from Loom | Demo Den | Atlassian

Ever wish you could turn a quick Loom recording into Jira work items without all the manual typing? Now you can! In this Demo Den episode, Pierre walks through a new Rovo AI feature that automatically converts your Loom videos into actionable Jira work items. Whether you're recording bug reports, feature requests, or project updates, Rovo handles the data entry for you. What Pierre covers: Turning Loom videos into work items with Rovo How it works in your AI-enabled Jira instance.

Why AI Coding Assistants Fail (And How to Fix Them)

Why do developers stop using AI coding assistants? According to Carnegie Mellon research, the top reason is unhelpful suggestions. Tabnine's Principal Architect John Feeney explains how context transforms AI coding tools from generic to genuinely useful. Learn the 4 Cs framework for maximizing AI assistant value: Context (workspace indexing), Connection (repo integration), Coaching (rules-based guidance), and Customization (fine-tuning). Discover how Retrieval Augmented Generation (RAG) helps AI understand your codebase, not just open source patterns.

Building dbRosetta Using AI: Part 1 of Many

Like many of you, over the last couple of years, I’ve been using AI, or, well, let’s just name it appropriately, Large Language Models (LLM), as a part of my job. I’ve also used it in my hobby. With it, I’ve generated snippets of code, tested data conversions, even built a small database for a presentation. However, to date, I haven’t tried doing everything through the LLM. Now, I’m going to.

Intent-Driven Assertions are Redefining How We Test Software

Traditional UI testing struggles to keep up with rapid design and workflow changes, often focusing on brittle selectors rather than user outcomes. Harness AI Test Automation introduces intent-driven, natural language assertions that understand what teams want to verify, not just how tests are written.

Enterprise data centre security solutions: scaling securely for growth and resilience

Securing a data centre requires multiple layers of protection. Physical access controls, surveillance, and network safeguards reinforce one another to prevent disruption. As estates expand and workloads increase, those measures have to scale. If they don’t, gaps appear in both resilience and compliance. A data centre security solution must therefore protect infrastructure day to day while adapting to future requirements. Pulsant delivers this through an integrated framework.