Operations | Monitoring | ITSM | DevOps | Cloud

Three reliability best practices when using AI agents for coding

One of the biggest causes of outages and incidents is good old-fashioned human error. Despite all of our best intentions, we can still make mistakes, like forgetting to change defaults, making small typos, or leaving conflicting timeouts in the code. It’s why 27.8% of unplanned outages are caused by someone making a change to the environment. Fortunately, reliability testing can help you catch these errors before they cause outages.

How to Achieve Ethical Quality Assurance (QA) for Your Software Using Artificial Intelligence (AI)

As the use of artificial intelligence (AI) for software testing and quality assurance (QA) becomes increasingly prevalent, there are ethical considerations that must be addressed to ensure fairness, transparency, and accountability.

Enhancing Jenkins performance: Resource optimization for high-traffic workloads

Jenkins is the backbone of many CI/CD pipelines, automating builds, tests, and deployments at scale. However, when handling high-traffic workloads, such as during peak development hours, large-scale deployments, or parallel builds and pipelines, Jenkins can quickly become a resource hog, leading to slow builds, queue backlogs, and even system crashes. Optimizing resource usage is essential to ensure smooth, efficient, and scalable performance.

Monitor Microsoft Azure in Grafana Cloud: simplify and centralize your cloud provider observability

Organizations around the world use Microsoft Azure to power their businesses. The cloud computing platform includes hundreds of products and services organizations can use to build and manage applications, but monitoring those environments can often feel like navigating a maze of fragmented data, tools, and processes.

OpenTelemetry Metrics Explained: A Guide for Engineers

OpenTelemetry (often abbreviated as OTel) is the golden standard observability framework, allowing users to collect, process, and export telemetry data from their systems. OpenTelemetry’s framework is organized into distinct signals, each offering an aspect of observability. Among these signals, OpenTelemetry metrics are crucial in helping engineers understand their systems.

How to avoid blowing the budget on Azure AI

So you had a great day playing with really awesome new tech, solving big business challenges, and feeling like you really nailed it. Then you wake up the next day to an alert from Azure telling you you've blown your monthly budget and its only the first week of the month. We've all been there... right? Using any cloud service comes with a cost, but for most services the budget risk is low. Cost calculated daily isn't a problem when usage is predictable, but not everything works like that.

Search and analyze unsampled logs in real time with Live Tail

With thousands of logs generated every minute from your infrastructure, applications, services, and devices, retaining all of this data for active search and analysis can be cost-prohibitive. Because log volumes continue to grow rapidly as operations scale, it’s common for organizations to implement log management strategies and limit the amount that they store in order to minimize costs.

Integration roundup: Monitoring your modern data platforms

Modern applications increasingly rely on specialized databases and platforms to power real-time analytics and support advanced AI/ML capabilities. These tools help teams accelerate development by consolidating workflows and processes, enabling faster and more efficient data operations. That’s why Datadog has launched three new data platform integrations with Supabase, DuckDB, and Milvus.

Networks are everyone's business - TCP Checks for app developers

Checkly is the industry’s best tool to monitor your production applications. With the power of playwright, developers can test the systems they’ve developed, and roll out those tests as production monitors running from multiple geographies on the Checkly system. And Checkly monitors thousands of API endpoints with complex validation, setup and cleanup scripts, and reliable alerting. So why are we expanding into TCP-based checks?