%term

The latest News and Information on Service Reliability Engineering and related technologies.

{unscripted} AI SRE

Sep 27, 2025 By Harness In Harness

Harness AI SRE is a comprehensive incident management system that uses AI to enable teams to detect, respond to, and resolve incidents efficiently. It integrates with various monitoring, alerting, and collaboration tools to provide a seamless incident resolution workflow.

View Video

Harness

Read more about {unscripted} AI SRE

How to Become an SRE Engineer

Sep 27, 2025 By Alexandr Bandurchin In Uptrace

Site Reliability Engineering has emerged as one of the most sought-after careers in tech, combining software engineering expertise with operational excellence. SRE engineers ensure that critical systems remain reliable, scalable, and performant while enabling rapid feature development. With the global SRE job market projected to grow by over 25% in 2025, skilled professionals in this field command competitive salaries and enjoy diverse career opportunities across industries.

Read Post

Uptrace

Read more about How to Become an SRE Engineer

Monitor Kubernetes Hosts with OpenTelemetry

Sep 26, 2025 By Anjali Udasi In Last9

It’s 3 AM. API latency just spiked from 200ms to 2s. Alerts are firing, and users are frustrated. You SSH into the first server: top, free -h, iostat — nothing unusual. On to the next host. And the next. That’s how most of us learned to debug. The tools worked, and we got good at using them. But as infrastructure became distributed and dynamic, this approach started to break down. Modern monitoring needs more than SSH and top. It needs unified telemetry.

Read Post

Last9

Read more about Monitor Kubernetes Hosts with OpenTelemetry

Key APM Metrics You Must Track

Sep 23, 2025 By Anjali Udasi In Last9

Application Performance Monitoring (APM) helps you understand how your software runs in production. When you track the right metrics, you see how requests move through your system, where slowdowns happen, and how resources are being used. With this knowledge, you can spot issues early and keep your applications reliable for your users. In this blog, we discuss the key APM metrics to monitor, grouped into categories, and why each one matters for performance and user experience.

Read Post

Last9

Read more about Key APM Metrics You Must Track

Analyze alert rules and reduce alert fatigue with Last9 MCP

Sep 23, 2025 By Last9 - Monitoring for AI Native SDLC In Last9

Learn more: https://last9.io/mcp/
Get started: https://last9.io/docs/mcp/

View Video

Last9

Read more about Analyze alert rules and reduce alert fatigue with Last9 MCP

How to Connect Jaeger with Your APM

Sep 22, 2025 By Anjali Udasi In Last9

Microservices make it tough to understand how applications behave end-to-end. Most teams already rely on an Application Performance Monitoring (APM) tool to track system health. But as requests move across many services, you also need distributed tracing. Jaeger gives you that visibility. The real value comes from connecting the two. Instead of running APM and Jaeger in silos, you can combine their strengths, metrics from your APM, and traces from Jaeger, to get a clearer view of performance.

Read Post

Last9

Read more about How to Connect Jaeger with Your APM

AWS Prometheus: Production Patterns That Help You Scale

Sep 19, 2025 By Anjali Udasi In Last9

You've got Prometheus running in one cluster — maybe a dev environment, a single EKS cluster, or a proof-of-concept setup. The configuration is straightforward: node_exporter on a few EC2 instances, some service discovery for pods, and a single Prometheus server scraping everything. Storage is local, retention is 15 days, and you can keep all the default recording rules without worrying about costs.

Read Post

Last9

Read more about AWS Prometheus: Production Patterns That Help You Scale

What is Asynchronous Job Monitoring?

Sep 17, 2025 By Anjali Udasi In Last9

Modern applications don’t process everything inside the request/response path. To keep APIs responsive, time-consuming work like image resizing, payment processing, or data syncs is moved into background queues. Workers then pick up these asynchronous jobs and run them outside the main thread. Asynchronous job monitoring is the practice of tracking these background tasks: Without this visibility, background workers become a blind spot.

Read Post

Last9

Read more about What is Asynchronous Job Monitoring?

Kubernetes Service Discovery Explained with Practical Examples

Sep 16, 2025 By Faiz Shaikh In Last9

In Kubernetes, applications are constantly changing — new pods start, old ones shut down, workloads shift across nodes. The challenge is making sure that different parts of your system, and even external clients, can still find each other when the actual locations keep moving. That’s what service discovery handles. It provides a stable way for applications to connect and communicate, no matter where they’re running or how often the underlying infrastructure changes.

Read Post

Last9

Read more about Kubernetes Service Discovery Explained with Practical Examples

What is Database Monitoring? A Guide for Developers, DevOps, and SREs

Sep 15, 2025 By Pavithra Parthiban In Atatus

Databases handle critical operations for applications, from online banking to e-commerce and streaming services. Any slowdown or failure can directly affect application performance and user experience. Database monitoring tracks performance, detects issues, and helps prevent downtime. It also ensures efficient use of resources, maintains security, and supports compliance requirements.

Read Post