Operations | Monitoring | ITSM | DevOps | Cloud

Google Workspace outage: July 18, 2025

Google Workspace went down again in July 2025—but if you had asked AI tools like Google’s own AI Overviews, ChatGPT, or Claude, you would have been told everything was fine. Every one of these tools incorrectly claimed that services were up and running while users across the globe were unable to connect, send messages, or even log in.

SentinelOne outage: July 10 incident went unacknowledged

July 10, 2025, SentinelOne, a leading cybersecurity platform, experienced a widespread outage that disrupted access to its admin consoles across multiple regions. The incident impacted users in Europe, North America, and beyond, preventing security teams from accessing critical management features. Despite the scale of the disruption, no official public acknowledgment or status update was issued by SentinelOne.

Kentik Cause Analysis in 60 Seconds

In a world where network traffic can suddenly spike, manually sifting through flow data is often a daunting task. Kentik AI's new Cause Analysis simplifies troubleshooting by quickly identifying changes in traffic by application, IP, ASN, or service. With just a few clicks, Cause Analysis helps you compare time periods, understand traffic shifts, and detect changes in your network. Kentik: Take the hard work out of running your network.

Beyond AI hype: put reliability at the forefront

Reliability is a constant for every technology, whether it’s cloud, microservices, or AI. Full transcript:  Just a few years ago everybody was screaming about microservices, "That's the wave of the future," and now everybody's looking at AI. No matter what the change in technology hot topic is, your reliability should still be at the forefront of everything that you're doing.

Are you running AI the smart way?

Data locality: AI models often rely on large datasets. Locating compute close to the data reduces transfer times and improves training performance. Latency sensitivity: Real-time AI applications, like recommendation systems or edge analytics, depend on low-latency environments. This can be more easily tuned in private or hybrid setups. Hardware specialization: Some AI workloads benefit from custom hardware like GPUs or TPUs. Private cloud allows more control over this, while public cloud offers broader access but less customization.

Is on-prem the top choice to run AI?

‎‎Subscribe. Fuel your curiosity. In this episode, we break down what we’ve learned from teams running AI at scale, and why on-premises infrastructure is making a strong comeback. We’re seeing a shift: performance, cost control, data sovereignty, and platform flexibility are driving conversations about on-prem strategies for AI. No one-size-fits-all answers, but if you’re building or scaling AI, this might help you think a few steps ahead.

Out-of-the-box Alerting for Frontend Observability in Grafana Cloud

Get alerted on frontend issues the moment they happen — no setup headaches required. In this short demo, Elliot Kirk from Grafana Labs introduces out-of-the-box alerting for frontend observability. Whether you're tracking error counts or web vitals, this new feature makes it easy to stay ahead of performance issues. With just a few clicks, you can: Enable prebuilt alerts for your apps Visualize and edit alerts directly in the UI Customize thresholds and durations Set up notifications and stay in the loop Launch alerting with every new app setup.

Semantic Caching: What We Measured, Why It Matters

Semantic caching promises to make AI systems faster and cheaper by reducing duplicate calls to large language models (LLMs). But what happens when it doesn’t work as expected? We built a test environment to find out. Through a caching system, we evaluated how semantically similar queries would behave. When the cache worked, response times were fast. When it didn’t, things got expensive. In fact, a single semantic cache miss increased latency by more than 2.5x.