Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Inside Atlassian's Merge Queues: How we ship faster with fewer incidents

At Atlassian, we use Merge Queues to ship frequent changes with confidence and streamline pull request merges. Across some of our busiest codebases, Merge Queues have sharply reduced incident frequency and turned merging from a stressful bottleneck into a background task. Today, most of our largest repositories rely on Merge Queues—over 70 large repos across products like Jira, Rovo, Trello, and others—having safely landed 30,000 pull requests since adopting Merge Queues Beta last quarter.

End-to-End Trace Propagation Across SQS and Lambda with OpenTelemetry

SQS doesn't propagate trace context automatically. You instrument both sides, deploy, and get two disconnected traces. This post shows how to wire them into one waterfall — and the ESM format gotcha that silently breaks it every time. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

How to run a proof of concept that de-risks your monitoring decision

Part 3, key insights from a fireside chat with Chris Yates. Read part 1 here, and part 2 here. Most database monitoring proof of concepts (POCs) answer the wrong questions. Here's how to structure a proof of concept that genuinely de-risks your vendor decision with the questions to ask during the process. A POC is often treated as the final hurdle in vendor evaluation, but too often, it becomes theatre. A guided tour of the flashiest features, run by one person, under unrealistic conditions.

Building for Resilience: An Engineering Guide to the Mythos Era | Harness Blog

The release of Anthropic Mythos and Project Glasswing marks an exciting and pivotal new chapter in software development. As the industry advances, the speed and economics of vulnerability exploitation have fundamentally shifted. What once took weeks of manual reconnaissance can now be scaled rapidly through automated models. However, this is not just a security problem to solve. It is a massive engineering opportunity to build cleaner, more robust systems.

Infrastructure as Code Management: Terragrunt & Multi-IaC | Harness Blog

What happens when your Infrastructure as Code management strategy works perfectly in dev, scales reasonably well in staging, and then quietly fractures across seventeen production workspaces because nobody documented which Terragrunt wrapper goes with which AWS account? You spend Friday afternoon reverse-engineering DRY patterns that made sense six months ago, wondering why your team is managing three different IaC execution engines with four incompatible workflow philosophies.

Five questions your platform evaluation is missing

Years back I sat in on a platform evaluation with a customer who spent forty-five minutes of the meeting focusing on one thing: their custom PHP content management system. They had opinions about the CMS. Strong opinions. They had benchmarks, a migration plan, a proof of concept. They had a diagram. They had questions about the deployment pipeline for this CMS that were, for a single application, more thoroughly considered than most organizations' entire infrastructure strategies.

Why do you need incident alerting? (And why monitoring alone isn't enough)

Monitoring tools track what’s happening across your systems and send a Slack message or email when something looks off. But they don’t call anyone and they don’t escalate the incident. If that Slack message goes unseen at 3 AM on a Saturday, the incident just sits there until someone opens their dashboard. Incident alerting fills this gap. When an incident triggers, it contacts the right person directly through a phone call or their preferred channel.