Operations | Monitoring | ITSM | DevOps | Cloud

Why Your Telemetry(Observability) Pipelines Need to be Responsive

At Mezmo, we consider Understand, Optimize, and Respond, the three tenets that help control telemetry data and maximize the value derived from it. We have previously discussed data Understanding and Optimization in depth. This blog discusses the need for responsive pipelines and what it takes to design them.

How Network Observability Helps Lay the Foundation of Autonomous IT Operations

We often hear the term "observability" in the context of DevOps and how SREs use telemetry data. Collecting and analyzing this telemetry data is a vital first step to a successful autonomous IT operations strategy. Observability can help you find out about problems in your system you didn’t know you had—and before your users are impacted—by giving you new visibility that your monitoring systems don’t provide. But any observability initiative must also include network observability.

The CoPE and Other Teams, Part 1: Introduction & Auto-Instrumentation

The CoPE is made to affect, meaning change, how things work. The disruption it produces is a feature, not a bug. That disruption pushes things away from a locally optimal, comfortable state that generates diminishing returns. It sets things on a course of exploration to find new terrains which may benefit it more—and for longer.

Making The Case for Continuous Observability

Software complexity grows exponentially, developer efficiency grows far slower. And debugging often takes up 20-50% of development time. More complex, connected systems means increased data flow at the edge, and in the cloud. That leads to increased exposure to vulnerabilities, cyber threats, malfunctions, and bugs with risks that are hard to assess.

Transform and enrich your logs with Datadog Observability Pipelines

Today’s distributed IT infrastructure consists of many services, systems, and applications, each generating logs in different formats. These logs contain layers of important information used for data analytics, security monitoring, and application debugging. However, extracting valuable insights from raw logs is complex, requiring teams to first transform the logs into a well-known format for easier search and analysis.

Get granular LLM observability by instrumenting your LLM chains

The proliferation of managed LLM services like OpenAI, Amazon Bedrock, and Anthropic have introduced a wealth of possibilities for generative AI applications. Application engineers are increasingly creating chain-based architectures and using prompt engineering techniques to build LLM applications for their specific use cases.

Destroy on Friday: The Big Day A Chaos Engineering Experiment - Part 2

In my last blog post, I explained why we decided to destroy one third of our infrastructure in production just to see what would happen. This is part two, where I go over the big day. How did our chaos engineering experiment go? Find out below!

Streamlining Debugging with Lightrun Snapshots: A Superior Alternative to Trace Logging

According to a recent study, failing tests alone cost the enterprise software market an astonishing $61 billion annually. This figure mirrors the vast number of resources devoted to rectifying software failures, translating into about 620 million developer hours lost each year. On average, engineers spend 13 hours to resolve a single software failure, a statistic that paints a stark picture of the current state of debugging efficiency.

What Makes for a 'Good' Pair Programming Session?

Software changes so rapidly that developing on the cutting edge of it cannot fall to a single person. When it comes to asynchronously disseminating information about projects, code comments, PR conversations, Slack, RFCs, and other investigatory documents do a wonderful job, but no amount of async communication replaces the magic of two brains bouncing ideas off of each other.