Operations | Monitoring | ITSM | DevOps | Cloud

Monitor the cost of your public sector applications with Datadog Cloud Cost Management

As federal, state, and local government agencies work to modernize their digital infrastructure and applications, managing costs effectively remains a constant challenge. Federal directives like Cloud Smart indicate the need for public sector IT organizations to track and optimize their cloud spends. However, as an organization’s IT environment grows in complexity, it becomes difficult to correlate cost data and extract useful insights.

Troubleshooting RAG-based LLM applications

LLMs like GPT-4, Claude, and Llama are behind popular tools like intelligent assistants, customer service chatbots, natural language query interfaces, and many more. These solutions are incredibly useful, but they are often constrained by the information they were trained on. This often means that LLM applications are limited to providing generic responses that lack proprietary or context-specific knowledge, reducing their usefulness in specialized settings.

This Month in Datadog - October 2024

On the October episode of This Month in Datadog, Jeremy Garcia (VP of Technical Community and Open Source) covers unified Error Tracking, Security Operational Metrics, and a new Datadog Serverless feature for retrying or redriving failed AWS Step Functions executions directly from Datadog. Later in the episode, Shri Subramanian (Group Product Manager) spotlights Datadog LLM Observability’s native integration with Google Gemini. Also featured are our blog posts Operator vs.

Create ServiceNow tickets from Datadog alerts

ServiceNow is a popular IT service management platform for recording, tracking, and managing a company’s enterprise-level IT processes in a single location. In addition to helping you manage your ServiceNow CMDB, Datadog also integrates with ServiceNow IT Operations Management (ITOM) and IT Service Management (ITSM), enabling you to automatically create and manage ServiceNow incidents and events from the Datadog platform.

How we use Scorecards to define and communicate best practices at scale

In modern, distributed applications, shared standards for performance and reliability are key to maintaining a healthy production environment and providing a dependable user experience. But establishing and maintaining these standards at scale can be a challenge: when you have hundreds or thousands of services overseen by a wide range of teams, there are no one-size-fits-all solutions. How do you determine effective best practices in such a complex environment?

Datadog on Building Reliable Distributed Applications Using Temporal

Temporal is an open source platform to build resilient and reliable distributed systems. Datadog started using Temporal in 2020 as the foundation for our internal software delivery platform. Since then, its usage has been widely adopted as a platform that any engineering team can use to build their systems. In this Datadog on episode, Ara Pulido chats with Loïc Minaudier, Senior Software Engineer in the Atlas team, responsible for providing a developer platform on top of Temporal, and Allen George, Engineering Manager in the Datadog Workflows team.

Introducing the Datadog Architecture Center

To prevent visibility gaps in your cloud environment, you need to efficiently deploy observability solutions that integrate easily with key technologies in your stack and scale reliably with new applications and migrated workloads. But observability deployments can be complex, often requiring deep and specific knowledge that may not be available within your teams.

Track and troubleshoot MongoDB performance with Datadog Database Monitoring

Many modern applications rely on MongoDB and MongoDB Atlas to manage growing data volumes and to provide flexible schema and data structures. As organizations adopt these and other NoSQL databases, effective monitoring and optimization become critical, especially in distributed environments.

Ensure high service availability with Datadog Service Management

Adopting a cloud-based, distributed architecture may help your organization scale quickly, but it can also add complexity. Correlating telemetry, security signals, and alerts across services often proves difficult, resulting in slower issue remediation. Additionally, when something goes wrong, figuring out who to contact—for example, the on-call responder or the service owner— may become needlessly time-consuming.