Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

FinOps and Cloud Cost Optimization

As companies scale, it’s become increasingly important to keep cloud cost management and optimization top of mind. In this talk, Yuval Yogev from Sygnia walks you through Sygnia’s optimization journey of cutting their total cloud costs in half. Yogev also shares insights into how you can optimize your own organization’s cloud usage and spend.

Ask a Site Reliability Engineer (SRE)

Site reliability engineering (SRE) can be complicated, and at Datadog, we’ve spent a lot of time thinking about SRE and refining how we implement it. Join Datadog’s Brandon West and Rick Mangi as they provide a brief overview of SRE and its core concepts. This video also contains a Q&A session from the live taping of this panel.

Scaling Up, One Network Bottleneck at a Time

Processing data at scale involves moving packets through a network—but what happens when that network isn't cooperative? Anatole Beuzon, a Software Engineer at Datadog, discusses how he investigated and resolved network issues in Datadog’s larger data-processing apps and how you can apply these same methods to your own production workloads.

I've Made a Huge Mistake: Implementing Agile on Infrastructure Teams

Bad planning methods can damage team morale and prevent teams from improving the systems they maintain. In this talk, Sam Handler from Shopify explains how his attempts to fix poor infrastructure planning processes through Agile methods failed. Drawing from this experience, he offers several principles that can help infrastructure teams improve the way they work.

Empower the SREs - Conclusions from The SRE Report 2023

Let's be honest, nobody loves surveys. Ok, well I sure don't. But surveys satisfy a huge need in our demand for insights into complex human-computer, sociotechnical systems. It turns out that we've been measuring the computer part pretty well, but the humans – not as easy to keep track of. When Google SRE first defined toil as a metric we wanted to reduce, we spent far too long trying to quantify it numerically based on tooling and insights from computer systems.

Observability is Still Broken. Here are 6 Reasons Why.

In an era where there’s no shortage of established best practices and tools, engineering teams are consistently finding their ability to prevent, detect and resolve production issues is only getting harder. Why is this the case? Our most recent DevOps Pulse Survey highlighted alarming trends to this end.

What is AIOps (Artificial Intelligence for IT Operations)? AIOps Use Cases

The volume of data that IT systems generate nowadays is overwhelming, and without intelligent monitoring and analysis tools, it can result in missed opportunities, alerts, and expensive downtime. However, with the advent of Machine Learning and Big Data, a new category of IT operations tool has emerged called AIOps. AIOps can be defined as the practical application of Artificial Intelligence to augment, support, and automate IT processes.

How to monitor Windows logs with the updated Windows integration for Grafana Cloud

As we all know, Windows is one of the most popular operating systems in the world. It has a dominant share in the desktop computer market, with more than 70% of the machines running the operating system. It makes sense, then, that the Windows integration is also one of the most used and popular integrations in Grafana Cloud.