Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Why Every Engineering Team Should Embrace AWS Graviton4

Two years ago, we shared our experiences with adopting AWS Graviton3 and our enthusiasm for the future of AWS Graviton and Arm. Once again, we're privileged to share our experiences as a launch customer of the Amazon EC2 R8g instances powered by AWS Graviton4, the newest generation of AWS Graviton processors. This blog elaborates our Graviton4 preview results including detailed performance data. We've since scaled up our Graviton4 tests with no visible impact to our customers.

Modern Observability in Action at the University of Oxford

The Bennett Institute for Applied Data Science at the University of Oxford is pioneering the better use of data, evidence, and digital tools in healthcare, policy, and beyond. The institute employs an open-source approach with its OpenSAFELY analytics platform, enabling high-impact research that yields actionable insights, drives innovation, and enhances lives globally.

The Hater's Guide to Dealing with Generative AI

Generative AI is having a bit of a moment—well, maybe more than just a bit. It’s an exciting time to be alive for a lot of people. But what if you see stories detailing a six month old AI firm with no revenue seeking a $2 billion valuation and feel something other than excitement in the pit of your stomach? Phillip Carter has an answer for you in his recent talk at Monitorama 2024. As he puts it, “you can keep being a hater, but you can also be super useful, too!”

Unlocking Smiles: HappyCo's Observability Success

With a diverse range of applications, HappyCo sought to advance their system investigations with a modern observability solution while embarking on an application refactor project. Since its start in 2011, HappyCo has experienced rapid growth through both organic expansion and strategic acquisitions. As a result, the company has a diverse range of applications for customers to smile about.

Navigating Software Engineering Complexity With Observability

In the not-too-distant past, building software was relatively straightforward. The simplicity of LAMP stacks, Rails, and other well-defined web frameworks provided a stable foundation. Issues were isolated, systems failed in predictable ways, and engineers had time to innovate on new features for the business. And it was good.

OpenTelemetry Best Practices #3: Data Prep and Cleansing

Having telemetry is all well and good—amazing, in fact. It’s easy to do: add some OpenTelemetry auto-instrumentation libraries to your stack and they’ll fill your disks with data pretty quickly. However, having good telemetry data—data that’s curated into being useful—is something that is both cost-effective and represents good value.

Investigating Mysterious Kafka Broker I/O When Using Confluent Tiered Storage

Earlier this year, we upgraded from Confluent Platform 7.0.10 to 7.6.0. While the upgrade went smoothly, there was one thing that was different from previous upgrades: due to changes in the metadata format for Confluent’s Tiered Storage feature, all of our tiered storage metadata files had to be converted to a newer format.

Independent, Involved, Informed, and Informative: The Characteristics of a CoPE

As our Field CTO Liz Fong-Jones says, production excellence is important for cloud-native software organizations because it ensures a safe, reliable, and sustainable system for an organization’s customers and employees. A CoPE helps organizations cultivate the practices and tools necessary to achieve that consistently. In part one of our CoPE series, we analogized the CoPE with safety departments.

Virtualizing Our Storage Engine

Our storage engine, affectionately known as Retriever, has served us faithfully since the earliest days of Honeycomb. It’s a tool that writes data to disk and reads it back in a way that’s optimized for the time series-based queries our UI and API makes. Its architecture has remained mostly stable through some major shifts in the surrounding system it supports, notably including our 2021 implementation of a new data model for environments and services.

Establishing and Enabling a Center of Production Excellence

Software is in a crisis. This is nothing new. Complex distributed systems are perpetually in a state far from equilibrium, operating in what Richard Cook has called a “degraded mode.” It’s through a combination of technical artifacts, organizational practices and policies, and pure gumption that they manage to maintain themselves through time. However, there are some organizations that seem to have an easier time of it than others.