Operations | Monitoring | ITSM | DevOps | Cloud

AWS outage takes down more than 150 cloud services

On May 7th and 8th, 2026, Amazon Web Services (AWS) experienced an outage affecting Amazon Elastic Compute Cloud (EC2) in the dreaded US East 1 region. The original region of AWS located in Northern Virginia, us-east-1 or just “US East” as it is known, has been the subject of some of the internet’s most high profile and destructive outages and remains Amazon’s least reliable region.

Operational Intelligence and the Hidden Structure in System Logs

Most IT teams do not suffer from a lack of data. They suffer from the amount of effort required to make sense of it. Every network device, application, cloud service, and infrastructure component generates a constant stream of machine output. Logs capture state changes, failures, retries, warnings, and thousands of other small signals about how systems behave. The problem is that raw logs are hard to use at operational speed.

Multi-tiered Observability: A Practical Way to Handle Diverse Workloads

Observability in large companies is rarely one-size-fits-all. The VictoriaMetrics topologies guide shows why different deployment patterns are needed as scale, isolation, and reliability requirements grow. Different workloads require different trade-offs: some need long retention for audits and trend analysis, while others need higher resolution for debugging. Business-critical systems also demand dependable alerting and high availability, often with several 9s of reliability.

Monitor Unreal Engine Game Performance with Application Metrics

Your Unreal game can ship with zero errors and still not feel great. Stutters during combat, a frame-rate cliff on the big boss, rubber-banding in multiplayer, none of it shows up as a crash and none of it shows up in Sentry, leaving you without any visibility into what your players are actually experiencing in the wild. Well, until now. Unreal Engine already gives you plenty of tools to measure game performance and collect runtime stats, but all that data stays on the dev’s machine.

The Journey to Production AI: Five Steps for SRE and Platform Teams

In a recent webinar, The Journey to Production AI, Andre Elizondo walked through what separates a working agent demo from an agent worth trusting on a 2 a.m. page. Live polls during the session put numbers behind a pattern most platform teams already feel. ‍ ‍ Most teams are early. The ones who are further along did not get there by shipping a flashier demo. They got there by treating production AI as a platform problem.

How Modern Ops Lost Their Bearings

Modern operations carry a quiet contradiction. Organizations have never had more data, more dashboards, or more instrumentation, yet teams increasingly struggle to gain a reliable sense of what the environment is actually doing. The problem is not the absence of information. It is the absence of bearings. This drift did not happen suddenly. It accumulated across years of transformation.

A Runnable Reference Architecture for Battery Energy Storage Systems on InfluxDB 3

A battery is a complex electrochemical system where safety and revenue are decided in milliseconds. Cell temperatures, voltages, and state of charge change in real-time; dispatch decisions and thermal alarms must fire in real-time. Anything in between—your data pipeline, your historian, your alerting layer—has to disappear into the background.

Diagnose and resolve database performance issues faster with Database Investigator

When your database performance degrades, diagnosing the root cause is rarely quick or straightforward. Your existing tools might surface metrics like CPU utilization, wait events, and query duration, but then leave you to correlate the data and identify what went wrong. Worse, what first appears to be the root cause can often just be a downstream effect of multiple interrelated issues.

Zero-Code OpenTelemetry for Vert.x

Drop a JAR on the JVM. Get distributed tracing, RxJava context propagation, log-trace correlation, and Vert.x internal metrics. No code changes. No Maven dependency. Java 8–21. Inside the design of last9/vertx-opentelemetry v2.3.4. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

From noise to knowledge: How GenAI is revolutionizing log management and analytics

Focusing on GenAI and logs for IT efficiency Efficiency is everything for managing today’s digital systems. Technology is constantly transforming and expanding operations are driving an explosion in data. Consequently, data ingest and storage costs have soared. But it’s not just storage data costs that keeps teams behind.The challenge of managing all that observability data forces IT teams to choose between efficiency and the bottom line.