Operations | Monitoring | ITSM | DevOps | Cloud

Datadog

Best practices for monitoring ML models in production

Regardless of how much effort teams put into developing, training, and evaluating ML models before they deploy, their functionality inevitably degrades over time due to several factors. Unlike with conventional applications, even subtle trends in the production environment a model operates in can radically alter its behavior. This is especially true of more advanced models that use deep learning and other non-deterministic techniques.

Lessons learned from running a large gRPC mesh at Datadog

Datadog’s infrastructure comprises hundreds of distributed services, which are constantly discovering other services to network with, exchanging data, streaming events, triggering actions, coordinating distributed transactions involving multiple services, and more. Implementing a networking solution for such a large, complex application comes with its own set of challenges, including scalability, load balancing, fault tolerance, compatibility, and latency.

Access Datadog privately and monitor your Google Cloud Private Service Connect usage

Private Service Connect (PSC) is a Google Cloud networking product that enables you to access Google Cloud services, third-party partner services, and company-owned applications directly from your Virtual Private Cloud (VPC). PSC helps your network traffic remain secure by keeping it entirely within the Google Cloud network, allowing you to avoid public data transfer and save on egress costs. With PSC, producers can host services in their own VPCs and offer a private connection to their customers.

Control your log volumes with Datadog Observability Pipelines

Modern organizations face a challenge in handling the massive volumes of log data—often scaling to terabytes—that they generate across their environments every day. Teams rely on this data to help them identify, diagnose, and resolve issues more quickly, but how and where should they store logs to best suit this purpose? For many organizations, the immediate answer is to consolidate all logs remotely in higher-cost indexed storage to ready them for searching and analysis.

Aggregate, process, and route logs easily with Datadog Observability Pipelines

The volume of logs generated from modern environments can overwhelm teams, making it difficult to manage, process, and derive measurable value from them. As organizations seek to manage this influx of data with log management systems, SIEM providers, or storage solutions, they can inadvertently become locked into vendor ecosystems, face substantial network costs and processing fees, and run the risk of sensitive data leakage.

Dual ship logs with Datadog Observability Pipelines

Organizations often adjust their logging strategy to meet their changing observability needs for use cases such as security, auditing, log management, and long-term storage. This process involves trialing and eventually migrating to new solutions without disrupting existing workflows. However, configuring and maintaining multiple log pipelines can be complex. Enabling new solutions across your infrastructure and migrating everyone to a shared platform requires significant time and engineering effort.

A closer look at our navigation redesign

Helping our users gain end-to-end visibility into their systems is key to the Datadog platform— to achieve this, we offer over 20 products and more than 700 integrations. However, with an ever-expanding, increasingly diverse catalog, it’s more important than ever that users have clear paths for quickly finding what they need.

Recapping Datadog Summit London 2024

In the last week of March 2024, Datadog hosted its latest Datadog Summit in London to celebrate our community. As Jeremy Garcia, Datadog’s VP of Technical Community and Open Source, mentioned during his welcome remarks, London is the first city that has seen two Datadog Summits, with the first one in 2018. It was great to be able to see how our community there has grown over the past six years.

And What About my User Experience?

Monitoring backend signals has been standard practice for years, and tech companies have been alerting their SRE and software engineers when API endpoints are failing. But when you’re alerted about a backend issue, it’s often your end users who are directly affected. Shouldn’t we observe and alert on this user experience issues early on? As frontend monitoring is a newer practice, companies often struggle to identify signals that can help them pinpoint user frustrations or performance problems.