Operations | Monitoring | ITSM | DevOps | Cloud

December 2024

Sponsored Post

What's new in Avantra 25 - AIOps for Cloud ERP

I am pleased to announce that we have released Avantra 25, the next evolution of the Avantra platform. This year we have focused on all things cloud, from native support of SAP BTP and SAP S/4HANA Public Cloud Edition to SAP RISE capable automation templates in our add-in library and our very own Avantra AIR cloud-based AI extension for Avantra, there's a lot to like with Avantra 25. There are some great new features though so let's dig deeper. For a complete list of changes, check out our public release notes.

Traceparent and Tracestate Explained: A Guide to Distributed Tracing with Atatus

In modern microservice architectures, requests often span multiple services, making it challenging to monitor and debug performance issues. Distributed tracing provides the ability to follow a request’s journey through these services, identifying performance bottlenecks and dependencies. The W3C trace context standard simplifies this process by introducing two critical headers: traceparent and tracestate.

Supercharging FerretDB Performance with Coroot: A Success Story

At Coroot, we’re passionate about providing developers with the tools they need to build and maintain high-performing applications. Recently, we had the opportunity to help a team using FerretDB, the open-source document database offering MongoDB compatibility with a PostgreSQL backend, significantly improve their monitoring and performance. This is their story.

Adding a Grafana Dashboard to Your Prometheus Setup

This article is part of a series on setting up an end-to-end monitoring and alerting stack using Prometheus. Continuing our series on setting Prometheus in a Docker container, we will add a Grafana instance to our Prometheus setup. Please refer to the previous article where we use docker compose to run Prometheus and Alertmanager together as that forms the basis to run multiple related containers. We will add a container to run Grafana to the same compose file in this article.

Amazon Bedrock vs OpenAI: Guide to Your Best Generative AI Platform

Amazon has heard FinOps practitioners’ cries asking for new AI tools, and the answer is Titan and AWS Bedrock. These new tools provide the same generative AI abilities of generating images like expected from DALL-E, operating like a Large Language Model (LLM) like ChatGPT, and even transcribing audio to text. But how do these new tools compare to pre-existing ones like Azure’s OpenAI? Most importantly, which of these tools is the best financial investment for your organization?

How Network Configuration Manager complements DevOps and automation practices

Network automation through DevOps practices has revolutionized IT infrastructure and configuration management, enabling faster deployments and improved operational efficiency. However, as networks grow more complex and diverse, solely relying on an in-house DevOps team can leave gaps in critical areas like compliance, security, and scalability. This is where Network Configuration Manager becomes indispensable.

Top 5 Things to Consider When Selecting a Log Analysis Platform

Here in this blog, we will discuss in detail how log analysis techniques are vital for the operation and protection of today’s complex IT networks. Understanding the functioning of the systems from where the log data is collected and analyzing user behavior is very much possible from log data originating from an organization’s software applications, networks, and security tools. They can also identify some situations that could be implying security issues.

Leverage log analytics dashboards for better monitoring

Visuals often communicate better than words, and this is also true for monitoring systems. Dashboards are an essential feature in log monitoring systems, providing great value to those who need to analyze and monitor logs. They help centralize log data in a simple, easy-to-read format, avoid clutter, and allow the team to focus on critical metrics.

WebSocket Application Monitoring: An In-Depth Guide

Real-time applications are everywhere and WebSocket monitoring is the key to running them smoothly. Whether you’re overseeing live chat platforms, online games, or collaborative tools, you want to make sure your WebSocket communications are flawless This guide breaks down the best practices and tools for monitoring WebSocket applications so you can optimize performance and reliability effortlessly.

Breaking the chains: Exploring the decentralized power of IPFS

Welcome to the universe of the IPFS, a protocol and peer-to-peer network that helps users retrieve and store files based on the content rather than the location of the requested information. It was founded by Juan Benet in 2015, who later founded Protocol Labs. IPFS enables its users to store and share content similarly to the way BitTorrent does.

Celebrating excellence: ManageEngine OpManager is one of SourceForge and Slashdot's Top Performers of 2024

ManageEngine OpManager has recently been recognized as a Top Performer for Fall 2024 on SourceForge, and Slashdot garnering 1,229 reviews and ratings, a prestigious accolade that highlights its exceptional capabilities in network monitoring and management. This recognition is a testament to OpManager’s commitment to delivering robust solutions that cater to the evolving needs of IT professionals.

14 top network monitoring trends in 2025

What's shaping the future of network monitoring in 2025? A window into network monitoring preferences across the world reveals a convergence of business, technology, and societal shifts. With new technologies like generative AI stepping into the spotlight, the question remains: how will it shape network monitoring? And, is Site24x7 ready for the future? Artificial intelligence (AI) and machine learning (ML) are transforming the way we think about network monitoring.

Enhancing your defenses: Key DRA updates from 2024

2024 saw a dramatic rise in sophisticated cyber threats, making robust security posture management more critical than ever. Throughout the year, Site24x7's Digital Risk Analyzer evolved to meet these challenges, empowering enterprises to proactively identify, analyze, and mitigate digital risks. This year-end roundup highlights the key enhancements and features that helped our users stay ahead of the curve in 2024.

Netdata Featured with Multiple "Best Of" Category Badges in 2024

As we are close to the end of this year, we are thrilled to announce that Netdata has been recognized with multiple “Best of” badges from Gartner Digital Markets brands: Capterra, Software Advice, and GetApp, leading software recommendation search engines. This “Best of” badges program is an independent assessment that evaluates user reviews to help buyers identify the highest-rated software companies in specific categories that offer the most popular solutions.

Understanding Buckets in Prometheus: A Comprehensive Guide with Real-Time Examples

Prometheus is an open-source monitoring and alerting toolkit that helps developers and operators track the performance and health of their systems. One of its key features is the ability to use buckets to measure and analyse distributions of data. Buckets are essential for tracking HTTP request durations, database query times, and memory usage, helping to understand system behaviour.

AWS Bedrock Pricing: Your 2024 Guide to Amazon Bedrock Costs

The future is AI. That’s a fact, and all the major cloud corporations are taking notice and investing in generative AI offerings to serve their customers better. Microsoft Azure has invested in OpenAI‘s ChatGPT, Google has Vertex AI, and Amazon has created Bedrock. But what exactly is AWS Bedrock? And, most importantly, how much will it cost? Will this generative AI be an easy investment, or will you have to break the budget to squeeze it in?

Anodot Wins Rising Star Technology Partner Award in the 2024 EMEA AWS Partner Awards

Imagine standing on stage at AWS re:Invent, surrounded by the industry’s best, as Anodot is named the 2024 Rising Star Technology Partner. It’s a proud moment for our team and a testament to the innovation driving our platform.

AWS re:Invent 2024: A Week of Innovation and Excitement

AWS re:Invent 2024 brought together a record-breaking 80,000 attendees in Las Vegas to explore the latest innovations in cloud computing, from generative AI to sustainability. In this post, Justin Ryburn shares his key takeaways from the event, highlighting AWS’s vision for the future and the vibrant energy of the conference.

December product updates

As 2024 comes to a close, we want to express our heartfelt gratitude for your support these last 12 months. It’s been a transformative year for StatusGator — with scores of new features shipped, an ever-maturing product, and a huge milestone crossed: 10 years of service. Below we’ll share a roundup of the December updates. These enhancements are designed to make StatusGator even more powerful and user-friendly. We couldn’t have done it without your feedback, so keep it coming.

Lumigo Copilot Beta Demo | AI-Powered Observability in Action

Discover how Lumigo Copilot transforms troubleshooting and observability with the power of AI. In this demo, we’ll showcase how Lumigo Copilot: Whether you're a senior developer or just starting out, Lumigo Copilot makes debugging smarter, faster, and more intuitive. Try Lumigo Copilot today: lumigo.io Subscribe for more product demos, tips, and insights on modern observability.

Splunk AppDynamics 24.10 Accelerates Deployment And MTTR

Splunk AppDynamics, now part of the Splunk Observability portfolio, provides critical observability for traditional 3-tier/n-tier applications and helps IT Operations teams quickly discover root causes of issues before end-users even notice. AppDynamics complements Splunk Observability Cloud, which is optimized for observing cloud-native applications by DevOps and engineering teams.

Experience the Future of Troubleshooting with Lumigo Copilot Beta

Troubleshooting complex cloud environments just got a whole lot easier. With Lumigo Copilot Beta, we’re redefining how developers identify and resolve issues in their production environments. We’ve captured it all in an exclusive video demo, showing you exactly how this cutting-edge tool empowers developers to stay in control.

Gartner IT Infrastructure, Operations & Cloud Strategies Conference recap: Reshaping enterprise observability with Next-Gen AIOps

For IT teams, the signal-to-noise ratio isn’t just a technical inconvenience—it’s the tipping point between operational success and systemic failure in today’s modern enterprises. At the Gartner IT Infrastructure, Operations & Cloud Strategies (IOCS) Conference 2024, this critical issue took center stage.

Enterprise-Grade Support in IT Monitoring: Why Organizations Choose ScienceLogic

In today’s complex IT environments, organizations face increasing pressure to maintain visibility across their infrastructure while keeping costs under control. While monitoring solutions built primarily on open-source components can seem attractive, enterprises must carefully consider how these components are supported, maintained, and secured to ensure they meet enterprise requirements.

Decoding devices with DHCP fingerprinting for smart IP address assignment

In today’s dynamic network environments, where countless devices—ranging from laptops and smartphones to IoT sensors and smart appliances—connect and communicate, efficient IP address management is critical. Ensuring each device receives the right configuration not only optimizes network performance but also improves visibility and control. However, identifying these devices accurately can be challenging, given the diversity of operating systems, hardware, and vendors.

Your Guide To Datadog Cost Optimization: 7 Tips For Reducing Spend

As cloud systems become increasingly sophisticated, you want a cloud monitoring platform that helps you identify, isolate, and fix root-cause issues. Meanwhile, engineering leaders are under increasing pressure to reduce technology costs as the global economic outlook remains uncertain. With Datadog, you can observe, monitor, analyze, and report on the health of your infrastructure, applications, and services in any cloud and at scale.

How to support a growing Kubernetes cluster with a small etcd

Etcd plays a critical role in your Kubernetes setup: it stores the ever-changing state of your cluster and its objects, and the API server uses this data to manage cluster resources. As your applications thrive and your Kubernetes clusters see more traffic, etcd handles an increasing amount of data. But etcd’s storage space is limited: the recommended maximum is 8 GiB, and a large and dynamic cluster can easily generate enough data to reach that limit.

Monitor your Pinecone vector databases with Datadog

Pinecone is a vector database that helps users build and deploy generative AI applications at scale. Whether using its serverless architecture or a hosted model, Pinecone allows users to store, search, and retrieve the most meaningful information from their company data with each query, sending only the necessary context to Large Language Models (LLMs). By providing the ability to search and retrieve contextual data, Pinecone enables you to reduce LLM hallucinations and enhance data security.

AWS re:Invent '24: Generative AI Observability, Platform Engineering, and 99.9995% Availability

I attended Amazon Web Services re:Invent conference. This is AWS's annual user conference, which takes over most of Las Vegas for a week. There’s a lot to do and take in—customer stories galore, new tech, learning different use cases, and all the walking. But you’re here to hear what I learned, so I’ve broken it down into sections. Enjoy!

Comparing Azure NSG and VNet Flow Logs

Azure VNet flow logs significantly improve network observability in Azure. Compared to NSG flow logs, VNet flow logs provide broader traffic visibility, enhanced encryption status monitoring, and simplified logging at the virtual network level enabling advanced traffic analysis and a more comprehensive solution for modern cloud network management.

Build user trust and foster transparency with StatusIQ's 2024 enhancements

2024 has been a transformative year for StatusIQ, marked by continuous innovation to meet the growing needs of our users. This year’s enhancements aim to simplify incident communication, improve team efficiency, and, most importantly, build trust with your users. From providing deeper insights into status history to enabling seamless platform migrations, these updates reflect our unwavering commitment to empowering teams and stakeholders.

The Contradictions of VDI: Is Your Investment Delivering the Desired Performance?

The promise of Virtualized Desktop Infrastructure (VDI) is deeply compelling - by virtualizing desktops, businesses can simultaneously give employees more flexibility and improve their productivity, while also reducing costs. However, despite significant VDI investment, too many organizations are still unable to realize these promised benefits.

Best practices for cloud-based network monitoring

When cloud adoption grew rapidly in the early 2010s, businesses started facing new challenges. Managing distributed systems, monitoring cloud-hosted applications, and ensuring network performance across global infrastructures became more complex. This shift in how businesses run IT operations creates a clear need for cloud-based network monitoring tools that can give you real-time insights into performance, security, and overall system health.

Reducing Downtime: How Unified Observability Tracks Authentication Bottlenecks

The user experience demands a seamless and secure method while logging in. According to a 2023 report by Statista, 66% of users report ditching a website or application due to lagging or authentication issues. Typical users expect the login to be fast and secure, regardless of Single Sign-On (SSO) or Multi-Factor Authentication (MFA).
Sponsored Post

Enhanced Monitoring for Citrix Environments

Citrix has become a cornerstone of modern IT infrastructures, particularly as organizations adapt to remote and hybrid work models. By providing secure, scalable solutions for delivering virtualized applications and desktops, Citrix enables businesses to centralize their IT operations while ensuring employees can access the resources they need from any location. Key solutions, such as Citrix Virtual Apps and Desktops (VAD) and Citrix Application Delivery Controller (ADC), empower organizations to manage virtualized workloads efficiently, regardless of the underlying hardware or cloud environment.

Reflecting on 2024: A year of growth and innovation for CloudSpend

CloudSpend Wrapped 2024 As the holiday spirit fills the air, it’s time to look back at a truly transformative year for ManageEngine CloudSpend. With a suite of groundbreaking features, 2024 saw CloudSpend evolve into a powerful tool for managing multi-cloud costs effectively. Let’s revisit the key milestones and share some exciting glimpses of what lies ahead.

Obkio 2024 Year in Review

As we wrap up 2024, we’re proud to say it’s been our biggest year yet at Obkio. From expanding our team to releasing groundbreaking features, attending industry events, and welcoming customers from all over the world, this year has been a testament to our commitment to growth and innovation. Join us as we look back on all we’ve accomplished in 2024 – and get a sneak peek at what’s in store for an even more exciting 2025!

Optimizing E-commerce Application Performance (APM) During High-Traffic Holidays!

Everyone within an e-commerce business knows that the stakes are high during peak shopping events such as Black Friday, Christmas, and many other holiday seasons. Periods like these are make-or-break for your revenue. Application performance is the key to success. A slow e-commerce website, stalled transactions, or unanticipated downtime invariably leads to frustrated customers, lost sales, and long-term damage to a brand’s reputation.

Troubleshooting CORS Errors in Offsite API Calls

You may have wrestled with a web application attempting to call an offsite web service, such as an OpenTelemetry Collector, and gotten an odd error with the word CORS in it. Something like: Or, maybe you got a generic thrown error from your fetch statement that states Error: Failed to fetch …and you wondered, “What’s the problem, and how can I fix it?” These kinds of errors are called CORS errors, and they can be a bit confusing.

Best practices for monitoring event-driven architectures

Microservices architectures empower individual teams to choose their own programming language, tools, and technologies, resulting in more independence and the ability to develop and release features faster. While there are various types of integration patterns that can facilitate microservice communication, many organizations choose to adopt event-driven architectures (EDAs) because of their scalability, agility, and resilience.

How Autonomic IT Helps Enterprises Meet the Demands of a Digital and Dynamic Business Landscape

Autonomic IT is the pinnacle of IT evolution. Inspired by the human autonomic nervous system, it refers to self-managing IT systems that autonomously monitor, optimize, and resolve issues. By integrating data, advanced AI and machine learning (ML), and automation, Autonomic IT enterprises can predict, prevent, and resolve IT issues more proactively, enhancing efficiency and reliability. However, Autonomic IT is more than just a framework for machines to fix themselves.

Critical Context: Adding Trace Quickview to Logz.io's Explore

Complexity rules the day within the world of data systems and pipelines. A goal for any observability practice is to help reduce complexity and give users and administrators a clear view of what’s happening in any system. This is the path to unified observability, a mature system where monitoring and troubleshooting are streamlined. This has been difficult to achieve for many organizations.

The evolving role of SREs: Balancing reliability, cost, and innovation

A look at the expanding roles of SREs and the new skills needed: cost management and AI Imagine the CTO walks into your team meeting and drops a bombshell: "We need to cut our cloud costs by 30% this quarter." As the lead SRE, this might cause a strong reaction — isn’t your job about ensuring reliability? When did you become responsible for the company's cloud bill? If you've had a similar experience, you're not alone. The role of site reliability engineers (SREs) is evolving fast.

New improvement: Component filter tags for easier filtering

One of StatusGator’s most important cloud service monitoring features is component filtering. Many services have multiple components such as regions, products, or features and not every component may be relevant to you. Our new component filter tags help you quickly identify how many components of a service you’re currently monitoring. This makes it easier to ensure your notifications are focused on what matters most.

Observability to AIOps: Transforming Anomaly Detection for Modern Enterprises

As businesses increasingly digitize operations, IT systems are evolving into complex, distributed ecosystems. Applications run across multi-cloud environments, microservices power critical processes, and data flows in real time across countless touchpoints. While this transformation drives agility and scalability, it introduces significant challenges: hidden anomalies that can disrupt operations, frustrate users, and damage revenue.

Diving into .NET 9.0, Blazor, and Observability with Coralogix

So, there I was, a newbie to.NET 9.0, Blazor, and Coralogix, standing on the precipice of observability in a world of production bugs and development mysteries. As an Agile enthusiast, I’m well versed in all things “observability” and how it’s a game-changer for root cause analysis, especially in today’s rapid, iterative development cycles. Observability is like getting X-ray vision into your application to understand what’s truly happening based on system outputs.

Jitter vs Latency: Definitions and Differences for Better Network Performance

If you’ve ever experienced choppy audio or video calls, slow website loading, or laggy gaming sessions, chances are you’ve dealt with either latency or jitter issues – or possibly both. These problems plague networks both large and small, from Fortune 500 companies to neighborhood coffee shops offering free WiFi.

How does Amazon VPC work?

Amazon Virtual Private Cloud (VPC) is a commercial cloud computing service that enables users to create a logically isolated section within the AWS Cloud. Users can deploy AWS resources in a self-defined virtual network within this isolated section. In essence, it enables customers to build resources within a private, separated area of the AWS cloud, such as databases, Elastic Compute Cloud (EC2) instances, and other AWS services. AWS offers VPC to enterprises as a way to improve cloud security.

Engineering AI systems with Model Context Protocol

On November 26, 2024, Anthropic released the Model Context Protocol (MCP)—an open standard for data exchange between applications and data sources. MCP simplifies how Large Language Models (LLMs) interact with external tools and data, addressing the challenges developers face when integrating AI into their systems. At Raygun, we’ve been exploring agentic workflows to improve productivity and saw real potential in MCP.

What is a DNS zone transfer? And how does it simplify transferring zone files from primary to secondary servers?

A DNS zone transfer is the process of transferring DNS records and zone files data from the primary server to the secondary server. This updates the secondary server with the current records and zone files so that it can act as a backup during failover scenarios. Zone transfer extends network services when the primary server fails by copying the primary server’s files to the secondary server.

Monitoring your Express application using OpenTelemetry

Nodejs is a popular Javascript runtime environment that executes Javascript code outside of a web browser. Express is the most popular web frameworks that sits on top of Nodejs and adds functionalities like middleware, routing, etc. to Nodejs. You can monitor your express application using OpenTelemetry and a tracing backend of your choice.

Monitoring your Nextjs application using OpenTelemetry

Nextjs is a production-ready React framework for building single-page web applications. It enables you to build fast and user-friendly static websites, as well as web applications using Reactjs. Using OpenTelemetry Nextjs libraries, you can set up end-to-end tracing for your Nextjs applications. Nextjs has its own monitoring feature, but it is only limited to measuring the metrics like core web vitals and real-time analytics of the application.

Implementing OpenTelemetry in a Rust application for performance monitoring

OpenTelemetry can be used to trace Rust applications for performance issues and bugs. OpenTelemetry is an open-source project under the Cloud Native Computing Foundation (CNCF) that aims to standardize the generation and collection of telemetry data. Telemetry data includes logs, metrics, and traces. Rust is a multi-paradigm, general-purpose programming language designed for performance and safety, especially safe concurrency.

How StatusIQ helps executives make informed decisions

Good decisions help leaders keep operations smooth and grow their business. With the right data at the right time, executives can solve problems quickly and take advantage of new opportunities. This blog will explore how StatusIQ helps executives make informed decisions by providing real-time data on system health, optimizing resource allocation, enhancing communication, and improving customer trust.

Black box and white box monitoring, and why modern IT observability needs both

Monitoring is essential for enhancing the reliability, performance, and user experience of all software systems. IT operations can employ two key monitoring strategies to assess system health: black box and white box monitoring. This blog discusses both approaches and highlights how ManageEngine Site24x7, an AI-based IT observability platform, can assist organizations in adopting white box monitoring to improve IT operations.

Grafana LLM plugin updates: choose the LLM models and providers that work best for you

At Grafana Labs, our mission has always been to empower users with the tools they need to build their own observability solutions. Our big tent philosophy embodies this mission by allowing you to choose the tools and technologies that best suit your needs. In this post, we want to share an update to our LLM plugin that reflects this philosophy in action.

Taking Sentry's Rollback from Hack Week Project to Production

If you’re a developer that uses Sentry and you’re reading this in 2024, stop reading and head to rollback.sentry.io to get your very own Rollback! Just a few weeks ago, we released Sentry Rollback — our first ever year-in-review experience. Think Spotify Wrapped, but for recapping your year as a developer on Sentry.

re:Invent Recap Livestream: 2024

Did you miss this year’s re:Invent? Or maybe you were onsite but too busy deep diving on certifications, new products, and networking. Don’t worry—the Datadog team is streaming right to your home on December 17 to recap all of the highlights from the event. Join Andrew Krug from Datadog’s Technical Community along with a host of AWS guests to hear about exciting announcements from AWS re:Invent 2024, Datadog’s latest product launches, and a rundown of the best on-demand sessions that you’ll want to make sure to tune into.

This Month in Datadog: Monitor OpenAI costs, Kubernetes Active Remediation, IaC Security, and more

Datadog is constantly elevating the approach to cloud monitoring and security. This Month in Datadog updates you on our newest product features, announcements, resources, and events. To learn more about Datadog and start a free 14-day trial, visit Cloud Monitoring as a Service | Datadog. This month, we put the Spotlight on Datadog Cloud Cost Management for OpenAI.

Introducing the Inspire Theme: A Fresh Look for Your Status Pages

We’re excited to announce the launch of our new Status Page theme: Inspire. This update brings a modernized design, enhanced customization options, and a sleek dark mode to elevate the way you communicate with your customers. See a live demo of the Inspire theme in action here.

VictoriaMetrics helps IHI Terrasun Win Big in Vegas on $1.2B Clean Energy Project

We’re happy to announce that VictoriaMetrics, open source time-series database and monitoring solution, and IHI Terrasun Solutions, a leading energy storage system integrator, have partnered on one of North America’s largest clean energy projects! The Gemini Solar + Storage project, which is carefully situated on less than 5,000 acres, is designed to provide clean energy for up to 10% of Nevada’s electricity needs during peak use times.

Complete Python Logging Guide: Best Practices & Implementation

Python's logging system provides powerful tools for application monitoring, debugging, and maintenance. This comprehensive guide covers everything from basic setup to advanced implementation strategies, helping you build robust logging solutions for your Python applications.

This Month in Datadog - December 2024

On the December episode of This Month in Datadog, Jeremy Garcia (VP of Technical Community and Open Source) covers Kubernetes Active Remediation, Datadog IaC Security, and a trio of new features for monitoring AWS resources. Later in the episode, Natasha Goel (Product Manager) spotlights Datadog Cloud Cost Management for OpenAI. Also featured is a short recap of Datadog at KubeCon North America and AWS re:Invent 2024.

Maximizing IIoT Impact with Open Data, AI, and Advanced Analytics: A Comprehensive Guide

This tech paper was created by IIoT World and InfluxDB. This post was originally published on IIoT World. The Industrial Internet of Things (IIoT) is revolutionizing industries like manufacturing, energy, and logistics by creating more intelligent, interconnected systems that elevate productivity and efficiency. With IIoT, machines, systems, and sensors are linked in real-time, streamlining industrial automation and making predictive maintenance a reality—all while reducing downtime and costs.

2024 Recap: Advancing Holistic Citrix Monitoring with SCOM and SquaredUp

As 2024 comes to a close, we at GripMatix are proud to reflect on an impactful year full of new solutions, updates, and enhancements for Citrix monitoring. With a focus on improving visibility, efficiency, and performance monitoring through Microsoft SCOM, we’ve delivered tools and features to address the evolving needs of Citrix administrators. Here’s a look back at what we achieved this year.

Optimizing Microsoft Teams Monitoring Insights

This whitepaper explores how proactive monitoring Microsoft Teams using the NiCE Active 365 Management Pack for Microsoft System Center Operations Manager enhances collaboration and communication in IT environments. The whitepaper highlights the importance of monitoring critical metrics, such as call quality, user activity, and service performance, to ensure seamless operations.

AutoCon2: The Network of the Future is Automated

AutoCon2 gathered network automation professionals in Denver for three days of insights, workshops, and discussions on the future of automation. Justin Ryburn, Kentik Field CTO, shares his key takeaways, highlights, and why this conference stands out as a top destination for the network automation community.

From downtime to uptime: The importance of server health monitoring for enterprises

What if your servers suddenly crashed during your biggest sales event of the year? For an enterprise IT manager, this isn't just a hypothetical—it's a disaster in the making. Server downtime doesn’t just cause frustration; it disrupts operations, erodes customer trust, damages your brand, and results in significant revenue loss. As businesses become increasingly reliant on complex IT infrastructures, ensuring server health is no longer optional—it’s essential.

MetricFire add-on: Show Sentry Errors in Annotations

‍ The solution: We can use Sentry to track specific errors that occur on production and Hosted Graphite's Sentry webhook add-on to add annotations to our system performance graphs. This way, we can correlate when a specific error occurs with our system usage spikes. ‍ Sentry is an application that alerts you when an app gets an error. It can also alert you to specific mistakes so you can see when and where something broke.

AI Strategies for Software Engineering Career Growth

Space.com sums up the Big Bang as our universe starting “with an infinitely hot and dense single point that inflated and stretched—first at unimaginable speeds, and then at a more measurable rate to the still-expanding cosmos that we know today,” and that’s kind of how I like to think about November 2022 for junior developers.

What's New in DX NetOps 24.3

With today’s dynamic technology landscape, network teams face increasing demands for agility, resilience, and efficiency. Broadcom's latest release, DX NetOps 24.3, is designed to meet these challenges. With features that enhance visibility, streamline operations, and simplify maintenance, the new version equips organizations with the capabilities to stay ahead in managing modern, hybrid networks. Let’s review key innovations available in the latest version of DX NetOps.

Enhance Network Observability with SystemEDGE for DX NetOps

In the increasingly complex network infrastructure stack, achieving complete visibility across every layer is no longer optional—it’s necessary. Network operations teams seek solutions that offer seamless observability across diverse infrastructures while minimizing operational costs. Enter SystemEDGE, a robust monitoring tool designed to amplify observability within the DX NetOps ecosystem.

How Do You Conduct a Reputation Audit? A Step-by-Step Guide for Brands and Individuals

In the interconnected and globalized economy, the potential for spreading a personal or business brand is limitless. Unfortunately, this unprecedented reach for growing brands also comes with equally large opportunities for damage from competitors, unhappy customers or former staff, reviews, social media content, and more. This makes reputation monitoring an essential part of operations for anyone with a sizable or growing presence for their personal or business brand.

Turbo360: Beyond the Rebrand - A Year of Growth and Transformation

It’s the end of another year, and we’re sitting down and scrolling through memories that defined our 2024 journey at Turbo360. We weren’t just building a product; we were creating something that would change how businesses see cloud management. This isn’t just another corporate year-end recap – it’s a story of growth, and the incredible community that’s been riding alongside us.

DBaaS Explained

Managing databases is a real pain point for businesses. It is time-consuming and complicated, and it often distracts resources away from main operations. This is where database as a service (DBaaS) comes into play. DBaaS is revolutionizing the way organizations manage their data—tremendously faster and in a much more secure manner. It is a cloud-native solution that manages your database problems. You no longer have to build and maintain your own database infrastructure.

How to securely connect Grafana to Google BigQuery using Workload Identity Federation

Umesh Pawar is a Senior Cloud Engineer at Searce, and is also the co-organizer of the Grafana and Friends Delhi Group. Umesh has been focused on infrastructure and app modernization, as well as observability solutions including the Grafana LGTM Stack, for the past two years. With the Google BigQuery data source plugin for Grafana, you can easily query and visualize data from BigQuery directly in Grafana.

Capacity Management: Debugging Exceeded Rate Limits

Snuba, the primary storage and query service for event data that powers Sentry in production, has historically been doing rate limiting under the hood, making it hard to discover and increasing time to resolve customer support requests. This is not something you’d know the specifics of unless you were deep in the Snuba code. But as we triage support questions from customers, one issue tends to pop up: RateLimitExceeded. You got tired of not getting query results.

5 Types of Network Topology and How to Choose the Right One

Successful businesses operate like well-oiled machines. Maintaining an effective network is the key to achieving this smooth operation. After all, the right network delivers efficiency and scalability and protects against cybersecurity threats. That said, building a network that meets your unique business needs can be challenging. To do so, you need to pick the correct type of network topography. But what is network topography, and what options are out there? Let's dive in.

Cribl Stream: Up To 47x More Efficient vs OpenTelemetry Collector

Let me set the record straight before anyone accuses me of bias or not being an OpenTelemetry supporter. Cribl loves OpenTelemetry! We’ve written lots of blogs about It; we have vendor-specific OpenTelemetry Destinations (with more to come!), and we support automatic batch parsing for easier data manipulation and re-batching for network transport efficiency of logs, metrics, and traces.

12 Days of Christmas Updates | RECAP

Day 1 We've added a filter called “host” to Real User Monitoring that separates events by URL, letting you compare performance and user experience across your websites. Day 2 Now you can delete all source maps from previous builds with a single API call. Day 3 Our Azure DevOps integration now supports multiple projects, so you can keep your source code and work items in different projects!

How MSPs Can Leverage AI to Increase Efficiencies and Increase Margins

The Managed Service Provider (MSP) industry is highly competitive. The growing demand for IT management and support has led to a proliferation of MSPs, ranging from small to established providers. This saturation intensifies pressure on profit margins and heightens expectations for delivering faster, more efficient services. With many MSPs competing for business, companies must find ways to differentiate themselves to attract and retain clients. At ScienceLogic, we know that AI holds the key to success.

How to Maximize Customer Retention with Grafana Cloud SLOs | Demo | Alert Management | Incidents

In this video, Mark Covelo, a Solutions Engineer at Grafana Labs, demonstrates how Grafana Cloud SLO empowers organizations to improve customer retention by simplifying the adoption of Service Level Objectives (SLOs). SLOs provide critical insights into the customer experience, enabling teams to prioritize and resolve user-impacting issues effectively.

How to Control Observability Costs with Grafana Cloud | Demo | Adaptive Telemetry | Loki | Profiling

In this video, Grafana Labs demonstrates how Grafana Cloud addresses the challenges of rising observability costs faced by organizations worldwide. As observability costs grow and logging architectures become increasingly resource-intensive, teams are forced to make difficult decisions about coverage, often leaving critical blind spots. Grafana Cloud offers a cost-effective, end-to-end observability solution that eliminates the need for compromise on efficiency or performance.

NiCE Oracle Management Pack | 5 Minutes Explainer Video

This short video will give a quick overview of the main features of the NiCE Oracle Database Management Pack, such as Discovery, Monitors, Diagnostic Tasks, Performance and Logfile Monitoring, Reporting, and, of course, Security aspects of advanced Oracle database monitoring.

How To Decide Between Hosting Your Own Status Page Versus Using a Managed One

A status page forms a key part of your incident communication strategy. When it comes to setting up a status page, you have two options: We will examine the pros and cons of each option along these dimensions: For 1, if you choose a self-managed, open-source or custom solution, it's in your control. For a managed solution, you are limited by the provider's feature set. For 2, if you choose a self-managed solution, your team is responsible for the quality of the service.
Sponsored Post

The 2025 Observability Survey by ManageEngine

While centralized visibility, faster MTTR, and improved business continuity represent one side of the coin, the other side reveals unpredictable licensing costs, data management toil, and tool sprawl. To gain a holistic perspective of this observability landscape and help the IT community with ground-level insights, we’re surveying IT professionals—CIOs, CTOs, IT managers, IT administrators, software engineers, and more.

Best uptime monitoring tools in 2025 (28 analyzed, 5 top picks)

Getting that message from a customer — "Your site is down!" — feels like a punch to the gut. Manual checks and basic scripts leave too much to chance. When every minute offline costs you money and frustrated customers, you need reliable uptime monitoring tools. But the market offers dozens of options, which can make choosing the right one challenging. This guide cuts straight to what works.

How to Monitor Firewall Performance: Tackling Firewall Overload

Is your network running slower than usual? Do you notice strange delays or unexpected drops in performance? Do not rush to blame your Internet connection – it could be your firewall. Think of your firewall as a gate for your network. If the gate is stuck, overcrowded, or not working properly, everything behind it suffers. The result? Sluggish speeds, frustrated users, and potential security risks. This article is here to help.

Unlocking the Potential of Private Location Monitoring with Secure Vault

At Uptime.com, we’re committed to delivering innovative solutions that enhance the security and reliability of your website monitoring experience. That’s why we’re thrilled to announce a significant update to our Private Location Monitoring (PLM) solution: natively integrating with Uptime.com’s Secure Vault.

Citrix Troubleshooting Masterclass - Automating Citrix Health Checks

Citrix Health Checks in half the time and at a fraction of the typical cost! Conducting Citrix Health Checks are critical to keeping your environment optimized for the best user experience, but many IT teams face several hurdles to conducting health checks: Time – Manual & time consumingResources – Not enough Citrix expertsFinances – Consultants are expensive and require continued engagement.

HEAL AIOps and Chatbot Solve the Alert Flood Crisis

Every IT environment relies on multiple monitoring tools to ensure smooth and uninterrupted operations across various systems—network, databases, servers, applications, and more. These tools constantly scan for any performance anomalies to keep everything running smooth. However, when there’s a spike in performance metrics—such as CPU usage, network traffic, or database activity—each of these monitoring tools triggers its own alert for what might be the same underlying issue.

Understanding gRPC: A Modern Approach to High-Performance APIs

With systems more interconnected than ever, the ability to communicate quickly and efficiently has become crucial today. This is where gRPC, an open-source framework by Google, comes in to transform the way APIs are designed and utilized. In this blog, we will explore what gRPC is, how it works, how it differs from existing protocols like REST, and the best practices for Optimizing its full potential.

Measure What Matters

Have you ever had an alert go off that you immediately ignore? It’s a nuisance alert—not actionable—but you keep it around just in case. Or maybe you’ve looked at a trace waterfall and wondered what exactly happened during a gap that just doesn’t drill down deep enough to explain what’s going on. Do you know the feeling where you have just enough information to monitor what’s going on in your systems, but not quite enough to put your mind at ease?

Top 10 Kubernetes Alternatives to Consider in 2025

Organizations exploring Kubernetes alternatives often face a critical decision when choosing the right container orchestration solution. While Kubernetes has established itself as the industry standard, companies are increasingly seeking alternatives that better align with their deployment needs, team expertise, and operational requirements. This comprehensive guide examines the top alternatives to Kubernetes, helping you make an informed decision for your 2025 container strategy.

2025 observability predictions and trends from Grafana Labs

From AI to eBPF, 2024 reshaped the observability landscape. As we peer into 2025, Grafana Labs’ experts predict another year of innovation that will redefine how teams understand and optimize their systems, from profiling to platform engineering. Their insights align with what the community is saying, according to early responses from our third annual Observability Survey. Do you agree or disagree with the trends our team believes will transform the world of observability next year?

From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines

Last week, I attended one of the last conferences of the year with team Mezmo: the Gartner IT Infrastructure, Operations & Cloud Strategies Conference in Las Vegas. Not surprisingly, there were over 20 sessions covering observability and how it is getting increasingly critical in the new complex distributed computing environment. Of course, there were many sessions, including all keynotes that addressed the advent and impact of AI on IT operations and observability.

Using server-side caching to speed up your applications, save on infra costs, and deliver better UX

If you’ve ever been floored by a sub-100ms response time, you’ve likely got caching to thank. Caching is the unsung hero of performance, shaving precious milliseconds off your application’s response time by storing frequently accessed data, avoiding yet another round-trip request to the database or API. Let’s break down how caching works and explore a few common strategies.

How good is GitHub Copilot at generating Playwright code?

People keep asking us here at Checkly if and how AI can help create solid and maintainable Playwright tests. To answer all these questions, we started by looking at ChatGPT and Claude to conclude that AI tools have the potential to help with test generation but that "normal AI consumer tools" aren't code-focused enough. High-quality results require too complex prompts to be a maintainable solution.

The Next Generation of AI-Powered Observability

AI is changing our world, and its impact on observability is no different. This article discusses some of the components of a good observability platform, how AI is well-positioned to revolutionize observability, and how Lumigo Copilot Beta will provide substantial value to customers and partners.

Actual cost of IT downtime: A Guide

In our recent blog, we spoke in detail about how system availability trends changed in 2024. From the observation over the years, even public clouds and established data centers aren't safe from downtime threats. Now, let's learn about the actual cost of an IT infrastructure downtime to understand the importance of comprehensive IT observability better. Today, we will cover the allied costs that come along with downtime.

Your monitoring made smarter: Site24x7's 2024 updates

2024 has been a year of progress and possibilities at Site24x7. Our mission? To simplify IT monitoring and help you focus on running your business uninterrupted. Whether it’s cutting through alert noise, diving into performance trends, or customizing dashboards to match your workflow, we’ve rolled out features that deliver exactly what you need for smarter monitoring. Let’s explore how these updates have delivered a better monitoring experience this year.

Understanding Docker Networking

This series will guide you through the most crucial container networking concepts. You don't need to be a Docker expert to comprehend the ideas introduced here, though a basic understanding of networking, Docker, and Kubernetes is required. Docker is a tool designed to create, build, and run isolated environments inside containers. It's widely used to containerize applications to run inside lightweight containers.

23 DevOps Tools to Watch in 2025

Combining “development” and “operations,” DevOps stresses a team approach to the software development lifecycle (SDLC). Development and operations teams used to function separately, which led to inefficiencies and increased the possibility of deployment mistakes. DevOps bridges this gap by integrating techniques and tools that ensure faster and more consistent software delivery, enhance team collaboration, and simplify operations.

Error tracking: Challenges and best practices

For small- to mid-sized businesses (SMBs) and mid-market enterprises, ensuring application reliability is critical to maintain customer trust and business continuity. Error tracking, a key aspect of observability maturity, is a powerful tool to proactively identify and resolve application issues. Let's explore the challenges of error tracking and best practices to implement effective solutions, including how Site24x7 can simplify the process.

Common Oracle Cloud Infrastructure (OCI) monitoring challenges

Oracle Cloud Infrastructure (OCI) provides a robust, versatile platform for modern cloud deployments, catering to businesses with diverse needs, like multi-region scalability, high customization, and hybrid cloud integration. However, the complexity of its architecture and the sheer volume of data generated can present unique challenges in effectively monitoring it.

Overcoming Performance Issues: Real-World Solutions to Keep Your Graylog System Running Smoothly

Are you experiencing performance issues with your Graylog instance? Are late-night alerts and unexplained slowdowns keeping you up at night? You're not alone if you’re dealing with license limit violations without a clear cause. In this session, we’ll share our experiences with these common Graylog challenges and the practical solutions we’ve developed to overcome them.

New Microsoft ILogger integration with Raygun

That’s a wrap on Raygun’s 12 Days of Christmas 2024! Over the past two weeks, we’ve rolled out daily updates featuring bug fixes and feature improvements inspired by your feedback. These small but mighty changes are all about making Raygun faster, smoother, and easier to use. Thanks for helping us level up—your input makes all the difference. Our special thanks to Blair from New Zealand who suggested this great idea!

Managing Long-Running Queries in MySQL: Best Practices and Strategies

Long-running queries in MySQL can significantly impact the performance and availability of your database. They can consume server resources, lock tables, and block other queries, leading to cascading performance issues. In this blog, we will explore why long-running queries occur, how to detect them, and best practices for managing and optimizing them.

NiCE Oracle Management Pack 5 4 Release Webinar 2024Q4

Discover Smarter Oracle Monitoring with the NiCE Oracle Management Pack 5.4 We’re excited to invite you to our upcoming webinar showcasing the powerful new features of Oracle Management Pack 5.4. Learn how this latest release enhances your Oracle monitoring experience on Microsoft SCOM and Azure Monitor SCOM MI, delivering smarter, more flexible solutions for your environment.

Simplifying service selection: Descriptions for all services now available

At StatusGator, we’re always looking for ways to improve your experience. One common challenge we’ve noticed is that users sometimes struggle to distinguish between services with similar names or unrecognizable logos when setting up Service Monitors for the aggregated status page. To address this, we’ve introduced short service descriptions to help you quickly understand the purpose of each service. This enhancement is now available in the following areas of the platform.

PromQL vector matching: what it is and how it affects your Prometheus queries

Dawid Dębowski is a software engineer at G2A.COM and a Grafana Champion. Holding an MS of Computer Science, Dawid’s main fields of interest related to observability are PromQL and data visualizations using Grafana. Have you ever created an awesome query in PromQL, expecting it to return the exact results you’re looking for, only to receive the “No data” response when you run it? If so, you might have fallen into the trap of PromQL vector matching.

Grafana Labs: Top 10 moments of 2024

2024 was a year of making connections. The open source community gathered in person for GrafanaCON for the first time in five years — meeting in Amsterdam to celebrate Grafana 11, Loki 3.0, a new open source project (cue Grafana Alloy), and more. TailCtrl, an early-stage company that specializes in adaptive trace sampling, joined Grafana Labs to advance our Adaptive Telemetry story (welcome, founder Sean Porter!).

Full-Stack Observability with OpenTelemetry and DX Operational Observability

DX Operational Observability (DX O2) from Broadcom supports ingestion and retention of OpenTelemetry (OTel) data. Teams who have instrumented applications with OpenTelemetry SDKs and APIs can now ingest telemetry into DX O2 using the OpenTelemetry Collector, a core component of OpenTelemetry, and the OTel Collector Exporter, which is now available through early access in DX O2.

AI Log Analysis - Shaping the Future of Observability

As digital applications and infrastructures grow increasingly complex, managing and understanding log data has become increasingly vital in achieving practical observability, enabling organizations to detect, diagnose, and prevent issues across their systems. However, traditional log analysis methods often struggle with the volume and complexities of modern log data in cloud-native environments.

12 Ways We Sleighed Innovation This Year

As we wrap up an incredible year, it’s the perfect time to celebrate Cribl’s progress and innovation in 2024! This year brought many exciting features designed to solve real-world problems and make life easier for our customers. In the spirit of reflection and festivity, I’ll highlight twelve game-changing product features, releases, and enhancements— each a testament to listening, learning, and delivering value to you, our users.

What is API Monitoring? How It Works, Benefits, & Best Practices

API Monitoring is the process of continuously observing and testing APIs to ensure they perform as expected, maintain uptime, and deliver the desired functionality. This includes tracking metrics such as API availability, uptime, latency, and response times. Whether you’re dealing with a REST API, a web API, or a microservices architecture, it’s important to understand that monitoring is essential for detecting issues before they impact end-users.

Amazon Cognito outage: How StatusGator notified customers 30 minutes before Amazon did

On December 12, 2024, Amazon Cognito experienced a significant outage in the US-EAST-1 (N. Virginia) region, impacting authentication for numerous applications. This operational issue, caused by a configuration change deployment, led to widespread “TooManyRequestsException” errors for several hours. Many Amazon Cognito users were left scrambling to figure out why their application was down, why users could authenticate, and how to get back up and running.

Balancing Standardization & Customization: Tailoring Security Monitoring to Your Unique Environment

So you’ve gone ahead and ingested every log you can think of and built a plethora of detections in line with frameworks and best practices. You may have even dabbled into custom alerts built from your own internal assessments and findings. Or maybe it’s the opposite; you’re still early in your journey toward security maturity or logging new or custom applications without much guidance. It can be hard to feel truly comfortable with your environment’s security in both situations. Standards are good but can be too noisy and restrictive in some places and too quiet or permissive in others.

Passwordless Authentication: Its Role in IT Service Management and Observability

Efficiency and security are critical to observability and IT service management (ITSM) in the digital era. Passwordless authentication is revolutionizing how businesses carry out these crucial functions by providing a seamless yet incredibly safe approach to access management. The integration of these technologies is essential for enhancing cybersecurity and streamlining processes in increasingly complex IT systems.

How to Identify GDPR Compliance Gaps to Protect Your Business

With the introduction of the General Data Protection Regulation (GDPR) in 2018, businesses across Europe and beyond have faced the complex task of ensuring compliance. The regulation was designed to provide individuals greater control over their personal data, thereby imposing stringent obligations on organizations that handle such data. Failing to comply can lead to hefty fines, reputational damage, and loss of customer trust. That's why businesses must proactively identify and close compliance gaps to mitigate these risks effectively. Let's dive into it.

How to design apps with Docker containers?

Do you want to streamline your app development process to make it efficient, scalable, and reliable? Building and deploying applications without the right tools quickly becomes complex and resource-intensive. Docker containers address these challenges by providing effective solutions. The 2023 Stack Overflow Developer Survey reveals that 42% of developers rely on Docker, demonstrating its pivotal role in modern workflows. Similarly, Datadog states that over 25% of organizations use Docker in production, which underscores its widespread adoption.

Top tips: Must-know holiday hacks for IT admins

Top tips is a weekly column where we highlight what’s trending in the tech world and list ways to explore these trends. This week, we explore ways in which IT admins can optimize the IT infrastructure during the holidays while leaving room for enjoyment. December is here, and the holiday spirit is in the air. While you prune your Christmas tree at the start of the holiday season, your IT infrastructure requires consistent pruning throughout the year by IT admins.

New option to reverse stack traces in Crash Reporting

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the past few weeks, we’ve shared daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Merry Christmas and thanks for following along—we’re excited to keep enhancing the tools you rely on!

Our team's learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook

A few weeks ago, members of Mezmo were at Kubecon and attended several sessions. You can see a post with my recap and session highlights. Today, though, I’m going to discuss three sessions that my colleagues found interesting for our peers in Observability.

Make NetFlow Flow Without Breaking The Network

Ever wondered how many NetFlow exporters or edge routers you have configured on your core switches? What if I told you that every exporter uses ~0.2% bandwidth in overhead? While that may not seem like much (and it has been a few years since most network engineers were worried about CPU overhead for NetFlow exports), older hardware and network OS versions may be more sensitive to having multiple flow exporters configured.

Scaling Observability on a Budget with Cribl for State, Local, and Education

Over the past year, I’ve noticed some interesting trends in my work with state and local governments. Across my conversations with organizations in this space, there’s a common thread: teams are getting creative about maximizing their limited resources. With budgets either flat or shrinking and operational demands increasing, these teams face tough choices. They’re being asked to maintain or improve services while working with the same, or in some cases, fewer resources than before.

Indicators of Compromise (IoCs): An Introductory Guide

To confirm cyberattack occurrences and build or enhance cyber-defense strategies, threat intelligence teams use a lot of information, including Indicators of Compromise (IoCs). These IoCs are actually forensic data that are critical in: The relevance of IoCs cannot be downplayed, but they're not all that’s needed in building an effective cybersecurity strategy. In this article, we’ll explore indicators of compromise, their types, and their relevance to threat intelligence teams.

Introduction to the OpenTelemetry Sum Connector

When you have a piece of data tucked into your logs or span tags, how do you dig for that bounty of insight today? Commonly this sort of data will be numeric, like a purchase total or number of units. Wouldn’t it be nice to easily turn that data into a metric timeseries? The Sum Connector in OpenTelemetry does just that, allowing you to create sums from attributes attached to logs, spans, span events, and even data points!

What Is Cloud Infrastructure?

We all know that testing new ideas on physical IT infrastructure requires a massive upfront cost. That's why businesses adopt cloud infrastructure setups. These setups offer on-demand resources, which allow you to start new projects and pay for only what you use. This eliminates the need for expensive hardware and maintenance, enabling flexibility that organizations require.

AWS EKS: Architecture and Monitoring

AWS Elastic Container Service for Kubernetes (EKS) is a managed service ideal for large clusters of nodes running heavy and variable workloads. Because of how account permissions work in AWS, EKS's architecture is unusual and creates slight differences in your monitoring strategy. Overall, it's still the same Kubernetes you know and love.

AWS microservices overview

With the nearly unmatched reliability and scalability offered by the 12-factor application design pattern, microservice-based designs have become a fundamental architectural pattern for modern applications. A whole industry of cloud providers has sprung up to offer management of the sophisticated middleware and infrastructure services that make this possible. Amazon Web Services (AWS) is among the largest of them.

Monitoring Security Vulnerabilities in Your Cloud Vendors

If you manage applications running on cloud platforms, you likely depend on multiple cloud vendors and services. These could be infrastructure providers like AWS, GCP or Azure. A vulnerability in any of these services could potentially impact your applications and your users. A cloud platform has many moving parts, many of which are dependent on other third-party providers.

Auvik Wrapped 2024

It’s a wrap! Auvik Wrapped is here to unravel all the amazing things we accomplished together in 2024. From keeping networks smooth to celebrating every win, we couldn’t have done it without YOU—our incredible customers and partners. This video is our highlight reel, our “thank-you note,” and a reminder that the magic happens when we connect. So grab some popcorn (and maybe your favorite IT pun), and let’s take a look back at what we built, fixed, and optimized together.

Reflecting on 2024: Advancing monitoring solutions for DevOps

As we approach the close of 2024, it's the perfect time to reflect on the remarkable progress and innovation at Site24x7. Our commitment to empowering businesses with robust monitoring solutions for DevOps and IT operations has led to significant advancements in application performance monitoring (APM), logs, databases, and plugins. Here's a recap of the year's milestones showcasing how they enhance your IT operations.

Analyze This! Notes from the Gartner IOCS Conference

There was a lot going on at November’s Gartner Infrastructure, Operations and Cloud Strategies Conference, but the central theme was inevitably the transformational impact of the AI revolution. IOCS is a major event bringing together leading vendors and thousands of practitioners for a mix of vendor-led sessions, expert presentations, keynotes, roundtables and one-to-one consultations.

Meta's meltdown: How we knew before they did (And you could, too!)

On December 11, 2024, millions of users around the globe experienced disruptions across Meta’s core platforms: Facebook, Instagram, and WhatsApp. Reports of connectivity issues and outages began flooding social media and third-party monitoring platforms as users scrambled to understand what was happening. While Meta issued a statement later in the evening attributing the outage to unspecified “technical issues,” the delayed acknowledgment left countless businesses and users in the dark.

Datadog Database Monitoring: Improve Database and Application Performance

Datadog Database Monitoring unifies query, application, and database telemetry in one platform, enabling teams to easily identify bottlenecks, understand database load, optimize query performance, uncover costly queries, and correlate database and application telemetry.

The Journey to Autonomic IT: How AI Advisors, not AI Assistants, Can Get You There

Today’s IT teams face unprecedented challenges as they manage increasingly complex hybrid and multi-cloud environments and vast amounts of data. The pressure to maintain uptime, optimize performance, and ensure security – all while balancing limited resources – has become a daunting task for even the most seasoned professionals. So how can these organizations stay ahead of the curve?

How MSPs can reduce MTTR and cloud costs with AI-powered observability

The scene is familiar to any IT operations professional: the dreaded 3 AM call, multiple monitoring tools showing conflicting status indicators, and teams pointing fingers instead of solving problems. For managed service providers (MSPs) supporting hundreds or thousands of customers, this challenge multiplies exponentially. But at AWS re:Invent 2024, Synoptek’s team revealed how they’ve fundamentally transformed this reality for their 1,200+ customer base through AI-powered observability.

New API endpoint to add comments to error groups

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time! Our special thanks to Gwilym from the U.K.

How your favorite apps use protocols: A look at real-world scenarios

Ever wondered how computers and servers talk to each other without descending into chaos? It’s all thanks to network protocols—the unsung heroes of the digital world. These nifty little rules tell devices how to format, send, and receive data, ensuring that even the most mismatched tech can have a civil conversation. This blog will explore different network protocols and how popular apps use them to ensure smooth performance and secure communication.

Secure AIOps: Do I really need the SAP transport? - Q&A

During customer implementations, we are often asked about the need for Avantra SAP transports. This post addresses many of the common questions we receive and explains the rationale behind our platform design decisions. Our choices are guided by security-first principles, SAP best practices, and over 23 years of accumulated expertise. These principles ensure a reliable, secure, and effective solution for our customers.

Network compliance and automation, IPAM, Cisco ACI monitoring, and more-key achievements in network monitoring: 2024

At Site24x7, we’ve always been about simplifying the complex and empowering IT teams to do more with less. This year was no different; we rolled out a host of new features and enhancements designed to help you manage your networks with greater confidence and ease. From streamlining compliance processes to enhancing visibility and automation, 2024 has been all about addressing your most pressing network challenges.

Why website monitoring is essential for building digital trust

Your website: it's where your customers connect with you. It's the digital embodiment of your brand, the 24/7 ambassador communicating your value and building crucial relationships. But what if that vital communication channel breaks down? Slowdowns, outages, and especially security breaches can instantly erode customer trust, inflicting lasting damage on your reputation and revenue.

Observability in the Age of AI

This post was written by Charity Majors and Phillip Carter. In May of 2023, we released the Honeycomb Query Assistant, an LLM-backed feature that lets engineers use natural language to generate and execute queries against their telemetry data. Instead of having to master a domain-specific query language, you can simply type in things like “slow endpoints by status code” and the Query Assistant will generate a relevant Honeycomb query for you to iterate on.

MongoDB vs. MySQL

Choosing a database is no easy feat. You must consider your organization’s current requirements and anticipate its future needs. Also, there are plenty of databases to choose from, and each type has its pros, cons, and use cases. To help you decide, we’re diving into a MongoDB vs. MySQL comparison in this article. We’ll review each database to help you decide which one might meet your needs.

Open source at Grafana Labs: 2024 year in review

Open source has always been the bedrock for everything we build here at Grafana Labs, going all the way back to Grafana creator Torkel Ödegaard’s first commit in December 2013. Ten years after Grafana Labs was founded, open source continued to be our driving force as we worked to develop and evolve our core OSS tools and technologies in 2024.

Honeybadger and ilert: smart incident response

We're thrilled to announce a native integration with ilert, combining Honeybadger's full-stack application monitoring with ilert's real-time alert routing and on-call management platform. ilert handles alert routing, escalations, and on-call scheduling, ensuring critical issues always reach the right person at the right time.

From Dev to Prod: Debugging in Next.js

Debugging. It’s a critical skill for all developers. And when you’re building a dynamic, high-performance application with Next.js, Chrome DevTools, and console.log() aren’t always enough. There are more effective and structured ways to debug Next.js apps as they scale. You will also find practical tips from our Next.js debugging workshop sprinkled throughout. Also, while this guide is focused primarily on Next.js, there is a similar guide for debugging React apps here.

Opslogix explores: How to bridge the gap between SCOM and Grafana with a SCOM Prometheus Exporter

As an observability architect, I have seen firsthand the power and importance of a robust monitoring solution. For infrastructure monitoring System Center Operations Manager (SCOM) stands tall. It is widely adopted and excels at monitoring the health and performance of infrastructure. However, as the need for advanced observability grows, such as tracking application logs and tracing code paths, SCOM's capabilities can fall short.

Cloud Status Third-Party Monitoring Gets Upgraded!

At Uptime.com, we’re committed to helping you monitor and manage the uptime and reliability of your websites and critical infrastructure. Based on your feedback, we’ve enhanced Cloud Status to deliver even more powerful insights into third-party dependencies and improve your experience. Here’s what’s new and what’s coming next!

The Year in Internet Analysis: 2024

Join Doug Madory, Kentik's Director of Internet Analysis, for an in-depth look at "The Year in Internet Analysis: 2024." This webinar replay explores key developments in BGP security, RPKI ROV adoption, and the evolving landscape of routing security. Discover insights into major submarine cable incidents, including their impacts and recovery, as well as an overview of Kentik's new Cloud Latency Map tool. Doug shares his expert perspectives on Internet trends, resilience, and what lies ahead in 2025.

Troubleshooting SD-WAN with Kentik Journeys AI

Discover how Kentik Journeys simplifies SD-WAN troubleshooting with the power of AI. In this video, we walk through identifying and resolving a network issue impacting a business application using a Postgres database. See how Kentik's conversational interface streamlines iterative network analysis, offering real-time insights into traffic patterns, device metrics, and routing behaviors. Learn how Kentik Journeys empowers teams to diagnose root causes quickly and collaborate effectively.

What is Performance Engineering?

Performance engineering transforms how organizations build and optimize software systems. System delays and performance issues directly impact revenue, user satisfaction, and business success. This guide covers performance engineering fundamentals, implementation approaches, and advanced strategies for building high-performing systems.

New private status ingestion integrations: Meraki, Neat Pulse, AT&T

Managing the reliability and uptime of critical services is a cornerstone of smooth business operations. While public cloud status pages provide general updates, they often fall short in reflecting the true status of your specific hosted tenants. Enter Private status ingestion, a powerful feature available exclusively on our Enterprise plan.

Increase visibility into network incidents using moovingon.ai and Datadog

moovingon.ai is a platform that consolidates alerts, incidents, audits, runbooks, and other resources for 24/7 network operations center (NOC) engineering teams. These teams often have to work collaboratively to maintain uptime for mission-critical cloud infrastructure and applications and need specialized resources to facilitate investigations in the event of an issue.

How to Protect Your Security Cameras From a Cyberattack

Security cameras are a crucial part of keeping homes and businesses safe. They offer peace of mind, capturing everything from mundane moments to critical security events. But here's the thing: these cameras, especially when connected to the internet, can be vulnerable to cyberattacks. Hackers love a good weak spot, and unfortunately, poorly secured cameras often fit the bill.

New copy stack trace button for Crash Reporting

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time! Our special thanks to Peter from the U.K. who suggested this great idea!

Cross-browser testing: Best practices for a seamless user experience

A wide variety of browsers are available in the market, with their usage varying significantly by device and region. Ensuring that your web applications work seamlessly across different browsers, devices, and versions is essential for providing a consistent and reliable user experience. This is where cross-browser testing comes in.

The quest for the four nines: Achieving 99.99% uptime with advanced website monitoring

In an age of instant access, your website is crucial for meeting customer expectations. Downtime, even fleetingly, translates directly into lost revenue, irreparable reputational damage, and the erosion of hard-earned customer trust. For enterprises, achieving near-perfect uptime – the coveted "four nines" (99.99% availability) – is no longer a luxury; it's a business imperative. This translates to a maximum permissible downtime of just 52 minutes and 36 seconds per year.

Analyzing user behavior to optimize user experience

Every interaction a user has with your application—whether positive or negative—directly impacts your business outcomes. But how can you uncover what truly shapes these interactions? The answer lies in analyzing user behavior to identify opportunities for user experience optimization and improving performance as user impact.

How to Monitor Your App's Performance with .NET Benchmarking

Benchmarking is essential in application development, especially if you aim to scale up your app. Benchmarking enables you to evaluate your application's resource consumption, which helps you identify potential updates to speed up performance. If not scaling, you will need the application performance to be optimal to enhance user experience and reduce memory and processing costs.

Cut Azure costs with AI-powered recommendations from LogicMonitor Cost Optimization

Managing Azure costs while ensuring performance and scalability can be a complex, resource-intensive process, especially when cost management tools are pieced together and used separately from monitoring solutions. To address this challenge, LogicMonitor is excited to announce cost-savings recommendations for Microsoft Azure compute and storage resources, simplifying cloud cost management with integrated insights.

Smarter email notifications: Better control and helpful tips

We’re pleased to introduce a meaningful update to our email notifications, designed to make managing alerts simpler and more intuitive. These changes help you stay focused on the notifications that matter most while giving you more control over how you monitor services.

Complete Guide to Azure IoT Hub: Pricing, Features & More

Azure IoT Hub enables you to monitor on-prem devices down to the smallest temperature change and react accordingly from cloud device commands. But how does it work? What other features does Azure IoT have? And, most importantly, how much does it cost? Learn all you need to know about Azure IoT Hub from our expert guide.

What Is Full Stack Observability? Best Observability Solutions

Full stack observability (FSO) includes the ability to measure and monitor all layers of business infrastructure, security, and applications, from the underlying hardware and network performance to the user-facing software. As businesses shift from traditional, monolithic systems to more complex environments involving on-premises (on-prem) and cloud infrastructure, there comes a critical need for holistic observability.

Hybrid Cloud Architecture Explained

As organizations transform and modernize their digital operations, the choice of infrastructure isn’t always clean cut. Hybrid cloud architecture is an increasingly popular approach to IT infrastructure, allowing organizations to take advantage of the best features of cloud and on-premises solutions. This approach enables enterprises to optimize workload placement for performance and cost, match applications to ideal environments, and strategically distribute critical assets.

12 Days of Christmas Updates | Day Ten: New copy stack trace button

New copy stack trace button for Crash Reporting. Copy your stack trace with a single click, no manual highlighting needed. If you're a developer you know quick wins for customers aren't always quick builds. Improving the developer experience is always top of mind for our team so we think this is an awesome win! Happy coding from the team at Raygun.

Grafana Cloud in 2024: Year in review

Throughout 2024, we made a ton of updates to Grafana Cloud, our fully managed, cloud-hosted observability platform powered by the Grafana LGTM (Loki for logs, Grafana for visualization, Tempo for traces, Mimir for metrics) Stack. And, looking back, most of those updates were made with the same three goals in mind: to make Grafana Cloud more efficient, more intelligent, and easier to use, including for those just starting out on their observability journey.

Building the Sentry Unreal Engine SDK with GitHub Actions

Ensuring a seamless player experience is critical for game developers, and yet unanticipated crashes and performance issues continue to harm games’ reputations and disrupt player engagement. To address this developers need proactive error monitoring across multiple platforms. Luckily, Sentry offers a robust SDK designed specifically for Unreal Engine to help developers debug and maintain performance effectively.

Elastic vs Sumo Logic: Build vs buy the right logging platform

When it comes to logging tools, organizations often face a classic tech dilemma: build vs. buy. Should you invest in a robust, ready-to-use SaaS solution like Sumo Logic or dive into the customization rabbit hole with a PaaS option like Elastic? It's a debate as old as time—well, as old as software, anyway. Let's break it down in a way that actually makes sense, and hopefully, it’ll spark less drama than the pineapple-on-pizza debate.

Break down barriers to log collection with Sumo Logic's Universal Connector

Today’s dynamic multi-cloud ecosystems receive logs from countless sources. Relying on custom collectors and integrations can lead to tool sprawl, pipeline breakdowns, and time-consuming maintenance. Enter Sumo Logic’s Universal Connector, your streamlined solution for collecting logs from any source. With seamless API integrations, Universal Connector simplifies log collection and eliminates the overhead of building custom pipelines.

SSL Monitoring, Trust, and McLOVIN

The recent ServiceNow Secure Sockets Layer (SSL) certificate error disrupted operations for hundreds of organizations causing widespread connectivity failures. IT operations stalled, developers hit roadblocks, and businesses across industries felt the impact. The culprit? An expired SSL certificate. While these disruptions highlight the importance of SSL monitoring, they point to a deeper issue: trust.

Christmas Holiday Website Monitoring to Ensure Peak Performance

Christmas Holiday Website Monitoring to Ensure Peak Performance The holiday season is a magical time of year—but for businesses, it’s also one of the busiest. As Christmas approaches, online traffic surges with shoppers hunting for gifts, deals, and last-minute purchases. While the increase in visitors could be great for sales, it also puts immense pressure on your website.

Shopify Status in 2024: Unveiling Patterns, Trends, and How to Stay Ahead

Note: The data presented in this analysis is based on information collected from January to November 2024 and may contain errors or omissions. As developers and organizations rely heavily on Shopify for managing online stores, understanding the platform’s reliability is essential. Monitoring the Shopify status page is crucial to staying informed about any disruptions.

New URL tester for Real User Monitoring's path segment rules

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time!

The ultimate APM playbook: Master challenges and implement best practices

IT organizations are adopting advanced technologies to keep pace with emerging business opportunities and trends. These technologies complicate the infrastructure and make it difficult for administrators to understand the underlying operations and transactions. Many of them struggle to leverage APM efficiently due to challenges like partial visibility, alert noise, scalability, delays in escalations, and much more.

What is O11y? Guide to Modern Observability

Distributed architectures with microservices, cloud-native components, and service meshes make traditional monitoring methods inadequate for system analysis. O11y (observability) implements advanced telemetry frameworks for deep system introspection through metrics, traces, and logs collection. This programmatic approach enables real-time debugging, performance optimization, and architectural decisions across distributed environments.

Logrotate: Choosing Between Size-Based and Time-Based Log Rotation

Managing log files effectively is crucial for ensuring a well-performing, reliable system. Logrotate, a popular log management tool, provides a flexible way to automatically rotate, compress, and remove old logs. Among its many configurations, two common approaches to trigger log rotation are size-based and time-based rotation. In this blog, we will explore the differences between these methods, compare their use cases, and help you decide which approach (or combination) suits your needs best.

SecOps Standardization Processor

Learn how to standardize data being routed to Google SecOps About observIQ: observIQ brings clarity and control to our customer's existing observability chaos. How? Through an observability pipeline: a fast, powerful and intuitive orchestration engine built for the modern observability team. Our product is designed to help teams significantly reduce cost, simplify collection, and standardize their observability data.

Automate Configuration Policy Adherence to Boost Service Levels and Compliance

Ensuring continuous network connectivity keeps getting more critical—but costly outages keep happening. This post looks at a key culprit behind many network outages: network configuration errors. We outline the key requirements for streamlining and automating configuration management, and detail how DX NetOps can help.

Out of box Infrastructure Monitoring native to OpenTelemetry

Infrastructure monitoring module based on OpenTelemetry This is our first release with infra monitoring module and we have added support for: In roadmap If you need any clarification or find something missing, feel free to raise a GitHub issue with the label documentation or reach out to us at the community slack channel.

The Hidden Costs of Hybrid IT: How to Close the Observability Gap

Hybrid IT environments are more complex than ever, and 76% of organizations struggle with ongoing cloud operational management. Why? Because most monitoring tools force you to compromise—leaving critical gaps in your observability strategy. The consequences? Slow issue resolution, missed SLAs, and a damaged customer experience that hits your bottom line. SolarWinds is here to help. Learn how we’re laser-focused on closing the hybrid observability gap and empowering you to maximize performance, minimize downtime, and protect your future growth.

Missing indexes are slowing down your database - here's how to find and fix them with Sentry

Slow database queries drag down performance for both developers and users. They waste resources, slow down testing, and frustrate customers with laggy experiences. But often, there’s a surprisingly simple fix: indexing. Here’s how indexing works and when to use it, regardless of your schema.

Understanding Develocity Build Data with Honeycomb

This post was written by David Chang, Staff Software Engineer at Pinterest, and originally posted on the Pinterest engineering blog on Medium. Develocity, formerly known as Gradle Enterprise, is a powerful tool that speeds up local and CI build time, helps troubleshoot your builds, and analyzes your data. At Pinterest, we have a dedicated team, Mobile Builds, and we ensure that developers can build fast and often. This enables developers to be more productive by getting faster feedback on their code.

Top 10 Website Monitoring Tools for 2024 (Free & Paid)

We all know that a website’s performance can make or break your business. Whether you’re running an e-commerce store, SaaS platform, or content-rich website, ensuring that your site is always up, and running is non-negotiable. A single instance of downtime can result in lost revenue, decreased customer trust, and a tarnished reputation. Website monitoring tools are essential for proactively identifying issues, ensuring optimal performance, and providing a seamless experience for your users.

Getting the Most Out of Python with SolarWinds Loggly

An audit and error trail is one of the core pillars of a well-designed software application, regardless of the programming language used to build it. This trail typically comes in the form of logging. When your application produces useful, rich logs, you are better equipped to successfully maintain a production-grade system and troubleshoot any issues that might arise. When it comes to distributed Python applications, having correlated logs for each system is important for debugging.

Charting the course in multi-cloud monitoring: Key moments from 2024

'Why settle for partly cloudy?' That's why Site24x7 went all in on making multi-cloud monitoring smarter, faster, and a lot less stressful. IT teams faced the usual chaos—expired certificates, rogue servers, and surprise bottlenecks—but Site24x7 stepped in with a toolkit so sharp, it could cut through latency like butter.

Enhancing Alerts with AI: Leveraging Amazon Bedrock and LLM's for Graylog

In this talk, we’ll explore the cutting-edge work InfusionPoints has done to process and enrich alerts from Graylog using Amazon Bedrock and advanced Large Language Models (LLMs) from Amazon Titan and Anthropic. Discover how we’ve harnessed the power of AI to elevate the accuracy, relevance, and actionable insights of our security alerts, transforming how we respond to potential threats.

Reducing Risk by Prioritizing Use Case Development

The session is really about customers spending their resources wisely, prioritizing use case development based on blind spots, weaknesses, or maybe even just plain audit findings. We have all been guilty in the past of spending a lot of time building clever use cases just for them to never fire or not work out the way we’d hoped; this talk is aimed at highlighting this issue and teach users to focus their resources and build a strategy for development like any other process they would internally.

About us - Sumo Logic

A log on its own is pretty simple, but they're rarely alone. Your digital applications, infrastructure and AI keep adding another, and another, and another… For some teams, this exponential data is overwhelming, causing friction, bottlenecks, and even tuning it all out. But at Sumo Logic, we’re FUELED by the atomic level of logs. The Sumo Logic Log Analytics Platform ingests each and every bit of this structured and unstructured “data exhaust,” transforming it into critical fuel for context-driven insights into your performance, availability, security status, and threats.

Is Your Telemetry Data Strategy Ready for the Next Decade?

What worked for the last 10 years won’t work for the next 10. IT and Security teams face three big challenges with telemetry data: Volume: Telemetry data is growing at a 28% CAGR, while budgets remain flat. Compliance requirements demand retaining massive datasets, straining both storage and costs. Variety: Logs, metrics, traces, configs—telemetry data comes in all shapes and sizes, making it difficult for traditional analytics tools to handle. Your tech needs to manage this complexity seamlessly.

Common Pitfalls to Avoid in Observability Practices

In modern IT systems, most businesses adopt new tools and technologies to stay ahead of competitors. These new technologies are resulting in the proliferation of distributed IT systems. For instance, some enterprises implement cloud computing, edge computing, or microservices architecture, contributing to complex distributed systems across organizations.

What is Network Flapping? Causes, Fixes, and Explanations

Network flapping is the rapid fluctuation of network routes or interfaces between an up (active) and down (inactive) state. This constant change disturbs the network’s stability by forcing routers and switches to repeatedly calculate the best paths for data transmission. Route flapping is a specific type of network flapping in which the route information advertised by routers changes frequently within a short period.

Best Practices for Troubleshooting a Windows Server Upgrade

To upgrade, or not to upgrade. While that may not have been the question that Hamlet asked, it’s one you might be asking. You already made the mistake of asking Reddit, “should I do an in-place upgrade,” and, as expected, people had Big Opinions. A Windows Server Feature Update offers benefits, like performance and analytics. On the other hand, if you have problems, then your attempts can lead to business downtime and service disruption.

Highlights from AWS re:Invent 2024

Whether or not you made the journey to this year’s AWS re:Invent, there’s always a variety of great announcements lost amid an action-packed week of keynotes, breakouts, expo hall demos, and networking sessions. No need to worry—we’re always happy to be a big part of the re:Invent experience and share our observations with you. You can also join us on December 17, 2024, for a re:Invent re:Cap livestream by registering here.

Lessons from Building an AI Copilot

Artificial intelligence is reshaping industries at an unprecedented pace. AI has found its way into almost every vertical, from writing code to diagnosing illnesses, promising efficiency and innovation. The idea of an AI Copilot—a tool that acts as your assistant to tackle complex tasks—is particularly exciting. In our space, observability, the possibilities seemed endless. We asked ourselves how AI could simplify troubleshooting in microservices.

Raygun's Christmas Bug Bashapalooza

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time!

New NOT operators for Raygun Alerting filters

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time!

Summarizing SRE/Ops Podcasts Using an LLM

There are plenty of good SRE/Ops related podcasts out there. I follow a few of them and listen to episodes whose titles sound interesting. The problem with podcasts is that some episodes focus on one topic, and other episodes deal with a host of topics. In between there is filler and things that are not relevant to the topic but are necessary to carry on a conversation. Spending 30-60 minutes listening to podcasts is not always a great use of time.

The Hidden Costs of Ignoring Your Online Reputation

Your online reputation is more than just what people say about you on the internet. It shapes the way customers, partners, and even future employees see your business. Ignoring it can cost more than you think. Whether you're a small local shop or a multinational company, your online presence can make or break you. Think about it. When was the last time you made a purchase without looking at reviews or checking the company's reputation? Studies show that 91% of consumers read online reviews before making a decision. If they come across negative feedback, most will simply choose a competitor.

How To Effectively Manage Remote Operations

With most businesses having to operate remotely today, there is a special set of challenges to be understood. Effective management of remote operations requires clear communication, the right technology, and a deep understanding of team dynamics. The article will look at some key strategies for successfully managing remote teams and ensuring operational efficiency, even when your workforce is spread across different locations.

New API endpoints for deployments

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time! Our special thanks to Andrew from the U.K.

How vmstorage Turns Raw Metrics into Organized History

vmstorage is the component in VictoriaMetrics that handles long-term storage of monitoring data. It receives data from vminsert, organizes the data into efficient storage structures, and manages how long data is kept. Before vminsert even sees the data, agents are out there collecting it, these agents gather metrics from different sources, hold onto the data briefly, and then send it over to vminsert in batches.

How InfluxData Enhances Performance and Reliability in the Aerospace Industry

The stakes are high in Aerospace manufacturing and operations. Aerospace systems are highly complex and require extremely precise engineering—every part of an aircraft or spacecraft must work together flawlessly, and error tolerance is minuscule. Ensuring that all components work perfectly under various conditions (pressure, temperature, vibration) is vital. The cost of building and operating aerospace systems is enormous.

Flowmon - AI-Powered Cybersecurity Platform

Today's primary cybersecurity challenge is event overload. With a flood of alerts coming from numerous systems, analysts struggle to prioritize and investigate effectively. This not only delays responses to genuine threats, but also leaves organizations more vulnerable. For progress, Flowmon, accuracy and rapid response are essential. Flowmon is an AI -driven network security analyst that works alongside your team, monitoring your network 24/7.

Power Up Your Alarms! Enriched UIM Alarms for Added Intelligence

An often-overlooked, powerful feature of DX UIM (Unified Infrastructure Management) is the Alarm Enrichment probe. Deployed on the Primary Hub as part of the standard installation, this feature has significant, often untapped potential to enhance the effectiveness of alarms generated by DX UIM.

The why and how of network availability monitoring

You might be familiar with the following scenario: You have a monitor displaying 20 open applications to oversee multiple networks or various aspects of your network infrastructure. Your inbox is steadily filling up with emails—many of which you can't seem to open and respond to in a timely manner. Outstanding tasks are accumulating, all due to an unexpected outage in a data center. If this resonates with you, it's likely that you are a network administrator or someone who works closely with them.

Top AWS monitoring best practices

AWS powers countless businesses with its vast services and unmatched scalability, but managing such a dynamic environment comes with challenges. Effective monitoring isn’t an option—it’s essential for ensuring performance, controlling costs, and maintaining compliance. Without a strategic approach, issues can escalate quickly, impacting customer experiences and business outcomes.

Leveraging AWS Private Image Build for a Compliant Cribl Deployment

In today’s data-driven world, ensuring the security and compliance of your data pipelines is paramount. Cribl Stream and Cribl Edge offer powerful telemetry data management and enrichment solutions. However, deploying these tools within your environment often requires careful consideration of security and compliance standards.

The Leading Synthetic Monitoring Tools

For accurate and effective performance testing, synthetic monitoring has become a staple and this is only going to continue in the coming years. This is mainly due to the fact that this process is beneficial and offers numerous advantages to organizations. With synthetic monitoring, your organization can identify performance issues before they affect real users. By continuously simulating user interactions, your team can highlight and rectify performance bottlenecks and infrastructure issues in real time.

Top 5 outages detected by StatusGator in November 2024

StatusGator continues to demonstrate its value by providing early warning alerts for service disruptions, often detecting issues before official acknowledgment. Below, we highlight key incidents from November 2024 where StatusGator’s monitoring helped users stay ahead.

Performing for the holidays: Look beyond uptime for season sales success

With the holiday shopping season in full swing, poor web performance can have a big impact on revenue. There’s intense competition for online shoppers, and customers will quickly bounce to another site instead of slogging through a bad experience. The best way to track and achieve your web performance goals is through experience-based SLOs (Experience Level Objectives, or XLOs).

Catching Flaky Tests Before It's Too Late

This is a guest post from Artem Zakharchenko, creator of MSWJS, an API mocking library for Javascript. He also writes about testing for EpicWeb and on his personal blog. Test flakiness is a big issue. Not only can it be a colossal time investment to detect and fix, but it hurts perhaps the biggest value you get from your tests—their trustworthiness. A test you cannot trust is a useless test. Time spent maintaining a useless test is time wasted; time that could have been spent building.

Optimizing ClickHouse Performance: Diagnosing and Resolving Common Bottlenecks

ClickHouse, a columnar database designed for high-performance real-time analytics, is excellent at handling large datasets with speed and efficiency. However, performance issues can occur due to factors like unoptimized queries, resource contention, or improper configuration. As data and query complexity grow, keeping ClickHouse fast can be challenging. This blog will explore common bottlenecks, how to diagnose and resolve them, and include a Python script for automating diagnostics. Lets get started!

Top 8 Docker Alternatives to Consider in 2025

Containerization platforms have evolved beyond Docker's initial implementation, offering specialized solutions for diverse enterprise requirements. Modern container runtimes focus on enhanced security models, optimized resource utilization, and seamless integration with cloud-native architectures. This analysis examines key alternatives that address Docker's technical limitations and provide advanced features for production workloads.

Step by Step Guide to Monitoring Apache Spark with MetricFire

Apache Spark is a powerful tool for processing and analyzing large datasets quickly, whether you're cleaning data for a report, running machine learning models, or analyzing real-time data streams. It's widely used for everything from building big data pipelines to crunching numbers for advanced analytics, thanks to its speed and ability to scale across clusters.

Super-charging your Cloud Operating Model with Turbo360

Efficient cloud management starts with visibility. Turbo360, an advanced Azure monitoring and cost management platform, not only provides this visibility but also transforms it into actionable insights, empowering organizations to optimize their Microsoft cloud investments. At its core, Turbo360 blends platform, solution, and infrastructure monitoring combined with cost management, offering a unified solution for complex Azure environments.

Grafana Loki Query Best Practices with LogQL (Loki Community Call December 2024)

In this December's Loki Community Call, Cyril Tovena, Senior Principal Engineer and LogQL guru walks us through a Grafana Loki query tutorial with LogQL, the Log Query Language used for Loki. He talks about the key "Dos and Don'ts" of LogQL, offering practical tips to help you write better queries, boost performance, and sidestep common mistakes. Whether you’re tuning up your current setup or just diving into LogQL, Cyril’s got you covered.

Real-time Windows Server Monitoring - From insights to Action

Are you ready to monitor your Windows machines? In this webinar, we’ll guide you through the latest strategies for real-time observability, system and infrastructure optimization. Featuring a hands-on live demo and insights from industry experts, this session will be full of actionable techniques to help you gain deeper visibility into your Windows infrastructure, troubleshoot faster, and improve performance with ease.

Debug Faster & Smarter with Session Replay

As developers, we know that debugging can be a time-consuming process. Hunting down elusive bugs or trying to reproduce an issue based on vague user reports can turn a simple fix into an hours-long journey. While leveraging logs, metrics, and tracing to reproduce locally or try to understand what happened can help us identify a root case, we’re often missing a critical component to truly being able to understand the impact on our users.

5 Essential Innovations Revolutionizing Pool Services Today

Innovation has turned the pool service industry into a thriving tech hub. Today, managing pools isn't just about chlorine and filters anymore. Technology transforms everything from scheduling maintenance to ensuring safety. Smart tools make processes smoother, save time, and improve customer satisfaction. Let's look at the cutting-edge advancements that are changing how businesses operate in this space.

Release Alert | Oracle Management Pack 5.4!

We’re excited to announce the official release of the NiCE Oracle Management Pack 5.4. The new release comes with tons of new features to make your life as a Microsoft SCOM admin easier. Learn how this latest release enhances your Oracle monitoring experience on Microsoft SCOM and Azure Monitor SCOM MI, delivering smarter, more flexible solutions for your environment.

Chargeback: A vital practice that often goes untapped in cloud cost management

Enhances cloud cost management with CloudSpend Chargeback Businesses are empowered to scale at unprecedented levels. However, with this growth comes the challenge of controlling cloud costs. Among the various financial considerations, one practice often slips through the cracks: chargeback. While this might sound like just another accounting term, having a chargeback model is a cornerstone of efficient cloud cost management—yet it often remains untracked or underutilized.

How Icinga Powers Worldline's Global Payment Solutions

We’re proud of our many customers and users around the globe that trust Icinga for critical IT infrastructure monitoring. That’s why we’re now showcasing some of these enterprises with their Success stories. It’s stories from companies or organizations just like yours, of any size and different kinds of industries. Some of them are our long-standing customers, others have just recently profited from migrating from another solution to Icinga.

Grafana 11.4 | Support for OpenSearch PPL and SQL queries in the AWS CloudWatch Data Source Plugin

In this video, Ida, a software engineer from the AWS Data Sources squad, introduces an exciting feature in the CloudWatch data source plugin. With Opensearch SQL and Opensearch PPL now supported, you can leverage familiar query languages to explore and visualize your AWS CloudWatch data alongside the existing Logs Insights query language. Learn how to: Availability.

Lightrun AI Autonomous Debugger

This video showcases how with Lightrun developer observability platform, developers can leverage the AI debugger within the platform plugin to swiftly identify critical code level issues through automated hypothesis and insertion of debugging actions at runtime (Lightrun dynamic logs, virtual breakpoints (Snapshots) and more. That helps reduce MTTR to mere minutes.

How I reduced an API call from >5 seconds to under 100ms

Given that 100% of the databases I have interacted with in my professional career have been SQL databases, my data-based mental model (please enjoy my pun) has always defaulted to a relational one. However, when spinning up a tiny side project in 2020 (a bot to provide interactivity to my Twitch stream), my data-storing requirements didn’t call for a relational model at the time, so I chose a NoSQL solution: MongoDB.

Grafana 11.4 release: Introducing support for OpenSearch PPL and OpenSearch SQL in the AWS CloudWatch data source plugin

Holidays came early for AWS users: Grafana 11.4 introduces support for two new query languages in the AWS CloudWatch data source plugin. Grafana 11.4: Download now Announced during AWS re:Invent, AWS CloudWatch Logs expanded its querying capabilities with the addition of OpenSearch Piped Processing Language (PPL) and OpenSearch SQL. In Grafana 11.4, the AWS Cloudwatch data source plugin has been updated to offer the same functionality — and the same flexibility.

SD-WAN Performance: Don't Trust, Validate. Here's How

Across regions and industries, organizations are continuing to expand their use of SD-WAN technologies. This move is happening for good reason. With SD-WAN, organizations are realizing reduced costs, improved communication security, and enhanced flexibility. One of the ways SD-WAN delivers these benefits is through an intelligent layer of abstraction that manages network traffic and dynamically controls the flow of data.

Are Our Networks Ready for AI?

With all the hype surrounding AI, it’s critical to focus on building resilient networks capable of handling the performance demands that AI will introduce. As I often say, if your network observability solution isn’t detecting packet loss, neither will your AI engine. When you ask, “What’s the status of our global network health this morning?” a flawed or incomplete response could jeopardize critical decisions.

Strengthen your server security: Site24x7's approach to patching and monitoring

Security is at the forefront of every decision made in the tech space. A key tool in the fight against server attacks is patching. In this article, let us learn how Windows-reliant enterprises secure their servers and VM instances: with patches and Site24x7.

How to Handle and Troubleshooting SIP 400 Bad Request Error

If you’re running into the dreaded SIP 400 – Bad Request Error, it might feel like you’re hitting a brick wall. But fear not! This type of error usually indicates that something is off with your SIP request, and it can be fixed pretty easily if you know where to look. Sometimes it’s due to typos, missing information, or formatting issues that cause a SIP server to not understand your request.

Smarter apps, deeper insights - A look back at 2024

It’s hard to overstate how fast technology is driving business evolution — and Microsoft 365 and Microsoft Teams were once again a big part of that story in 2024. At Martello, we evolved right in step, adding new capabilities to Vantage DX and shedding new light on what enterprises need to manage the Microsoft digital experience.

Automatically group events and reduce noise with AI-powered Intelligent Correlation

When you have a complex IT environment with many disparate tools, data sources, and teams, alert noise becomes overwhelming. This can delay incident response and cause missed alerts, ultimately leading to critical incidents and outages. Datadog Event Management’s Event Correlation groups and deduplicates events and alerts, reducing noise and helping response teams act on alerts faster.

Troubleshoot infrastructure changes faster with Recent Changes in the Resource Catalog

Organizations often struggle to maintain visibility and control over their distributed cloud infrastructure, where changes in a single resource can have cascading effects throughout the system and potentially cause disruptions. In these environments, infrastructure changes that lead to incidents are often hard to troubleshoot—especially when teams are using disparate tools with siloed data—leading to longer resolution times, more downtime, and negative business outcomes.

Six Enterprise AI Predictions for 2025

In many ways, the upcoming year is shaping up to be one of opportunity and innovation as IT leaders see more benefits and options around AI than ever before in running the enterprise. By the same token, this progress is creating new complexities and choices for organizations to navigate. Through conversations with ScienceLogic customers, leading industry analysts, partner companies and key executives, several AI-related themes have emerged moving into 2025.
Sponsored Post

Understanding Network Traffic Flow and Segment Analysis

With every webpage loaded, email sent, or video streamed, network traffic takes a complex journey across multiple infrastructure nodes. From the device to the destination, data packets travel across various gateways, networks, through routers, switches, and service providers along the way. Understanding the network traffic paths and segments along the journey reveals much about performance, latency, congestion, and possibly even bottlenecks. In this article, we'll delve into the interconnected, intricate routes, network traffic takes.
Sponsored Post

3 Raygun tips to keep your e-commerce site in peak shape this holiday season

The holiday season is here, and with it comes the yearly surge in online shopping traffic. If you're a software developer working for an e-commerce company, you're likely gearing up for the busiest time of the year. Keeping your application smooth, fast, and error-free during this high-pressure period is essential to ensuring a stellar customer experience. Luckily, if you're already using Raygun, you've got the tools to stay on top of it all! To help you be prepared, we've compiled three practical tips for using Raygun to keep your site running like clockwork-no matter how many shoppers hit your platform.

Auto-provisioning support for SAML SSO

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time! Our special thanks to Airton from Brazil who suggested this great idea!

Enhancing IT efficiency with network configuration management in OpManager Plus

Network configuration and change management (NCCM) is indispensable for maintaining healthy, secure infrastructures. Like other aspects of an infrastructure, network configurations are never static; they evolve, and proactively maintaining a watch on this dynamic environment is a daunting task.

.NET error grouper V8

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time! Our special thanks to Isak from Sweden who suggested this great idea!

The Complete Podman vs Docker Analysis: Features, Performance & Security

Choosing the right container engine for your infrastructure stack is a critical architectural decision. While both Podman and Docker implement OCI (Open Container Initiative) standards, their fundamental approaches to container management and runtime architecture create distinct operational characteristics.

Splunk Platform Use Cases, Written Just for You

If you're a Splunk customer, chances are high that you use either Splunk Enterprise or Splunk Cloud Platform on a daily basis. With powerful dashboards, scalable indexes, and data streaming, these core products give you immense data analysis powers and actionable insights. And that's something everybody wants! But you aren't everybody. You're uniquely you - a specific customer working in a specific industry with specific use cases.

Icinga Notifications Web: Desktop Notifications

We recently released the beta version of our Notification Web Module, which includes a cool feature that is not yet known to everyone. We named it Desktop Notifications (Browser Push Notifications). With this feature enabled, your browser can send you instant notifications based on your configured event rules—provided you’re logged into Icinga Web.

Introducing Warm Tier: Cost-Efficient Log Storage to Simplify Observability

These days, one of the most important decisions that organizations can make as it relates to their observability strategy is: “How much data do we want to retain in Hot storage to ensure we have everything needed for real time analysis — without running up associated costs?”

10 best practices to optimize single-page applications (SPAs)

Since they were introduced over two decades ago, single-page applications(SPAs) have transformed web experiences, offering fast and fluid interactions akin to native apps. With dynamic updates on a single HTML page, users can interact with a web app without waiting for page reloads. Compared to non-dynamic web pages, SPAs reduce server load and improve interaction speeds. However, SPAs also present unique challenges for optimization, especially as they grow more complex.

5 key OCI services you must monitor for optimal cloud performance

Effectively monitoring Oracle Cloud Infrastructure (OCI) services is crucial for maintaining smooth operations, cost efficiency, and robust security. OCI offers a suite of services, each vital to powering applications and workloads. Understanding what to monitor and why is key to preventing downtime, managing costs, and optimizing performance.

Cribl: Empowering Data Freedom with Open Standards and Unmatched Flexibility

If you are familiar with Cribl’s solutions, you know that we offer our customers choice and control over their data. The entire company is built on the idea that we want to help you get your data from anywhere to anywhere using open standards and open data formats. It is your data, and you have full control over what you collect and how it is handled.

Go Client Library for InfluxDB 3.0

In a world driven by data, efficient time series data management is a growing concern. APIs play a significant role in automating tasks, especially in cloud-based environments. Go, with its high performance and concurrency, is quickly becoming one of the standard languages for writing cloud infrastructure and utilities for managing streams of data.

ElasticGPT: Empowering our workforce with generative AI

Like all organizations, Elastic deals with an ever-increasing volume of information and data, making it harder for our teams to keep information up to date and for employees to find answers from relevant resources. As a leading Search AI company, our approach to customer-first starts with customer zero — us. When our employees needed a better way to find the information necessary to do their jobs, we knew we could use our own technology to bring that vision to life.

Cloud-Based Monitoring & Automation | Webinar by NiCE and Kelverion

Join us for an exclusive webinar presented by NiCE and Kelverion, where we’ll dive into how to enhance your IT infrastructure through cloud-based monitoring and automation. This session will focus on leveraging the power of Microsoft Azure and System Center Operations Manager Managed Instance (SCOM MI) for seamless integration, real-time monitoring, and automated workflows.

How to Mitigate DDoS Attacks and the Impact on Availability

Distributed Denial of Service (DDoS) attacks are intended to overwhelm a network or server and cause failure or work stoppage. DDoS attacks first appeared in the mid-1990s and continue to the present day. Far from going away, they have become more prevalent: in the first quarter of 2024, the number of DDoS attacks against web servers increased by 93% compared to the same period a year earlier. One survey found that nearly 70% of organizations experienced 20 to 50 DDoS attacks per month.

Understanding the Differences Between Flow Logs on AWS and Azure

AWS VPC flow Logs and Azure NSG flow Logs offer network traffic visibility with different scopes and formats, but both are essential for multi-cloud network management and security. Unified network observability solutions analyze both in one place to provide comprehensive insights across clouds.

Optimize and troubleshoot cloud storage at scale with Storage Monitoring

Organizations today rely on cloud object storage to power diverse workloads, from data analytics and machine learning pipelines to content delivery platforms. But as data volumes explode and storage patterns become more complex, teams often struggle to understand and proactively optimize their storage utilization. When issues arise—such as unexpected costs or performance bottlenecks—these teams frequently lack the visibility needed to quickly identify and resolve root causes.

The Art of the Possible: Drawing Up Your DEX Strategy

Digital Employee Experience (DEX) is more than just technology—it's a transformative journey. In this first episode of our series, Emily Schlick (Vizient) and Zakir Mohammed (Toyota) share their strategies and lessons from their first year of DEX implementation. Learn how they: Built a strong DEX strategy Empowered people and processes Partnered with Nexthink to drive success Whether you're starting your DEX journey or looking to level up, this episode is packed with actionable insights to help your organization thrive.

OpenTelemetry - Complete Guide to the Open-Source Observability Framework

In cloud-native environments, observability is key to ensuring the health, performance, and stability of distributed systems. Observability helps developers and operations teams understand how their systems behave in real time, helping diagnose issues, optimize performance, and meet service-level agreements.

5 Cybersecurity Tips for Managing Blockchain in Cloud Environments

Blockchain is reshaping industries by offering transparent and secure transaction processes. When paired with cloud environments, it unlocks even greater scalability. But this combination introduces risks. Without strong cybersecurity practices, sensitive data becomes vulnerable. Attacks on blockchain-based systems are rising, targeting loopholes in poorly managed setups. How can you protect your blockchain infrastructure in the cloud? Here are a few lynchpin strategies to implement for this purpose.

Split your projects in Azure DevOps

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time! Our special thanks to David from Atlanta who suggested this great idea!

Mastering cloud tag management: A key to smarter cloud cost management

Cloud tagging The cloud has revolutionized the way businesses operate, providing unparalleled scalability and flexibility. However, effective cloud cost management can be challenging, with a significant part of this due to the way cloud resources are tagged. In this blog, we’ll explore how cloud tag management plays a crucial role in cost management and reveal how tools like ManageEngine CloudSpend simplify budgeting and forecasting for organizations.

Implementing OpenTelemetry in Angular - A Practical Guide

Angular applications often grow in complexity, making it challenging to monitor performance and troubleshoot issues effectively. Enter OpenTelemetry: a powerful, vendor-neutral framework for telemetry collection. This guide will walk you through implementing OpenTelemetry in your Angular projects, enhancing your ability to observe and optimize your applications.

Gain comprehensive visibility into your ECS applications with the ECS Explorer

Amazon Elastic Container Service (ECS) is a container orchestration service that enables you to efficiently deploy new applications or modernize existing ones by migrating them to a containerized environment. Building on ECS gives you the flexibility, scalability, and security that containers offer, but also presents challenges in monitoring and troubleshooting your applications and infrastructure.

Introducing Datadog's Next-Generation Rust-based Lambda Extension

In 2021, we announced the release of the Datadog Lambda extension, a simplified, cost-effective way for customers to collect monitoring data from their AWS Lambda functions. This extension was a specialized build of our main Datadog Agent designed to monitor Lambda executions.

Grafana Alerting: Save time and effort with Grafana-managed recording rules

Grafana Alerting has seen steady growth and adoption since it was revamped in Grafana 9. Since then, we’ve been busy making your alerts more robust, more reliable, and easier to manage. As part of that process, Grafana Alerting has adopted several concepts from Prometheus. The Prometheus alerting model is well understood and flexible, and with Grafana Alerting we want to bring that same flexibility to all Grafana data sources.

What is Network Discovery? Everything You Need to Know

Network discovery is the crucial first step for any IT team looking to manage a modern, dynamic network. As companies embrace flexible work options and adopt complex hybrid environments, taking stock of all connected devices is essential to maintain performance, ensure security, and enable users to stay productive from anywhere. This article will cover everything you need to know about network discovery, from its core purpose to how it works to the tools that make it happen.

Simplify operations across hybrid cloud with OpsRamp

According to IDC, 80% of organizations are running hybrid and multicloud environments, bringing new complexities and risks for IT leaders*. When it comes to operations, IT teams find it challenging to maintain visibility across cloud and on-prem systems, optimize more and more tools, and automate operations—all while ensuring cost efficiency and staying agile. Traditional approaches complicate things further, often leading to silos and inefficient resource use.

MTTR guide: how to improve system reliability & response time

Your system just went down. Your team scrambles around frantically while customers flood your inbox with complaints. Each passing minute feels like an eternity — sound familiar? DevOps and SRE teams know this scenario all too well. Meantime to repair (MTTR) directly impacts your customer trust and company reputation. MTTR might seem simple on the surface — measure how long it takes to fix problems. But nailing this metric takes more than just tracking numbers.

How to create the perfect internal status page

Picture this: Your team is scrambling during a system hiccup. Messages fly back and forth, everyone's checking different dashboards, and no one has the full picture. Sounds familiar? That's why more companies use internal status pages as their single source of truth. These private dashboards show you everything that matters.

Actian & Grafana Cloud: The Search for a Customizable Observability Tool | ObservabilityCON 2023

Over the past few years, Actian has shifted from offering a solely on-premises data integration, management, and analytics product to supporting hybrid and multi-cloud environments as well. To keep up, the team needed a customisable observability tool, and found it in Grafana Cloud. Lead Cloud Operations Engineer Suleyman Kutlu will share his team’s journey, starting with metrics and logs, and venturing into load testing, frontend observability, IRM, and more.

Grafana Labs Customers: What We've Learned Building Observability at Massive Scale

An in-depth conversation with a panel of observability leaders from Sky, Just Eat Takeaway.com, and BlackRock. The panelists share stories about their organizations’ observability journeys, their perspectives on scaling observability across an enterprise, and their opinions on the current trends in the space.

How to elevate your IT strategy starting today: SolarWinds Observability Self-Hosted

Discover the power of SolarWinds Observability Self-Hosted, the ultimate solution for full-stack visibility across your hybrid IT environment. From network to infrastructure, apps, databases, and security, gain a centralized view to detect and resolve issues faster than ever before. What you'll learn in this video.

Sending Alerts Using Prometheus and Alertmanager

Continuing our series on setting up Prometheus in a container, this article provides a step-by-step guide for how to configure alerts in Prometheus. We will add alerting rules and deploy Prometheus Alertmanager with Slack integration. If you follow the steps in this article, you will end up with a containerized setup for: Let's get started.

Latest Product Updates and Features in Logz.io | December 2024

We’re rolling out new visualization capabilities in the Explore log management interface that are available now in some accounts and will be added to all in the coming weeks and months. With these updates you can: Warm Tier: There is now a new option for log storage and access that bridges the gap between high-performance Hot storage and the low-cost Cold Tier. Reach out to your customer success team for more information.

AI Agent RCA on Alerts: Get the Info You Need, Fast

A critical component of any monitoring and observability system is alerting. But alerts in and of themselves aren’t enough—when something goes wrong, time is of the essence, and your team needs to figure out not just what’s going on but how to fix it, and fast. Additionally, constantly chasing down alerts can be the bane of any observability practitioner’s existence.

The Why and What of AWS Lambda Monitoring

Serverless architectures are the rental tux of computing. If you’re using AWS to manage and scale your underlying infrastructure, you’re renting compute time or storage space. Your Lambda functions are the tie or cummerbund you purchase to customize your rental. Using the AWS event-driven architecture improves business agility, allowing you to move quickly. Lambda is the on-demand compute services that runs custom code driving an event’s response.

Kentik Bytes: Enhancing Azure Observability with Kentik

Kentik offers exceptional visibility into Azure public cloud environments, allowing users to easily filter and explore cloud telemetry. The platform provides detailed insights into network resources, including traffic metrics and peering information. Users can focus on specific applications and visualize data in a wide variety of formats, including Sankey diagrams. Additionally, you can adjust time frames, create alerts, and share reports for better traffic management.

Troubleshooting Cloud Traffic Inefficiencies with Kentik AI

Balancing cost efficiency and high performance in cloud networks is a constant challenge, especially when misconfigurations or inefficient routing lead to inflated costs or degraded performance. Learn how Kentik Journeys simplifies traffic analysis, helping cloud engineers identify inefficiencies like unnecessary Transit Gateway routing.

The New Way of React Native Debugging

This is a guest post from Simon Grimm, creator of Galaxies.dev, where Simon helps developers learn React Native through fast-paced courses and personal support. Debugging React Native apps has traditionally been a bit of a pain. Developers usually ranked debugging as their biggest pain point of React Native, which, as we all know, makes up quite a lot of development time. But the good news is that things are getting better.

Catch frustration before it costs you: New tools for a better user experience

Imagine you're on a website trying to purchase a product, but every time you click the "Add to Cart" button, nothing happens. Frustrating, isn’t it? Such moments can deter consumers from completing their online purchases. And while users find this annoying, it poses an even bigger challenge for businesses.

State of Cloud Costs

Cloud spending continues to grow, but managing costs effectively remains a challenge for many organizations. In this video, Datadog Senior Product Manager Kayla Taylor dives into our recent State of Cloud Costs report—which analyzed AWS cloud cost data from hundreds of organizations—to understand the key factors driving cloud expenses. We explore the impact of adopting emerging compute technologies like Arm-based processors, GPUs, and AI capabilities, how usage patterns and previous-generation technologies affect cloud costs, and the role of AWS discount programs in cost management.

Mobile crash reporting and debugging best practices

Maintaining a crash-free, stable mobile app should be top priority for all mobile developers. App stores penalize mobile apps that have high crash rates, and more importantly, buggy apps create poor user experiences, resulting in bad reviews and lost customers. Watch this session to learn key tips for identifying, resolving, and preventing crashes, fast, so you can spend less time troubleshooting and more time building.

Year-end recap: What's new in IT infrastructure monitoring: 2024

Effective IT monitoring is critical to maintaining seamless operations, and 2024 has been a year of addressing challenges and delivering solutions with Site24x7. From upgrading server health and performance to streamlining Kubernetes and VM administration, let's plunge into how Site24x7’s updates have helped IT teams tackle their monitoring challenges and enhance infrastructure reliability.

The Journey to Autonomic IT: Why Enterprises Must Let Go to Learn

Several of our recent blog posts have introduced the characteristics of each phase of the Autonomic IT maturity model, from Siloed IT to Coordinated IT (an essential foundation for Autonomic IT) and the transition to Machine-Assisted IT and AI-Advised IT. We explored how you can identify where your organization stands on this transformative journey, why you might not be as far along as you believe, and what is needed to advance your journey. Now we arrive at IT nirvana: Phase 5, Autonomic IT.

Monitor AWS Trainium and AWS Inferentia with Datadog for holistic visibility into ML infrastructure

AWS Inferentia and AWS Trainium are purpose-built AI chips that—with the AWS Neuron SDK—are used to build and deploy generative AI models. As models increasingly require a larger number of accelerated compute instances, observability plays a critical role in ML operations, empowering users to improve performance, diagnose and fix failures, and optimize resource utilization.

Common Microsoft Teams Issues & How to Troubleshoot

Microsoft Teams is one of the most popular tools for work communication today. Whether you're chatting with your team, jumping on a video call, or sharing files, it helps keep everyone connected. But let's face it – MS Teams isn’t perfect. You’ve probably run into issues like calls dropping, bad audio, or slow Teams performance. These problems can be frustrating, especially when you’re in the middle of an important meeting or deadline.

Top 10 Synthetic Monitoring Tools for 2024

When it comes to ensuring your website’s performance and uptime, synthetic monitoring tools have become indispensable. These tools help businesses proactively detect and resolve issues before they affect real users, offering peace of mind and optimal website performance. In this article, we’ll explore what synthetic monitoring is, the best tools for 2024, and why Dotcom-Monitor is our top choice.

IT Asset Tracking: Complete Control Guide

Managing your IT assets shouldn’t feel like juggling countless hardware, software, licenses, and online resources. Without comprehensive management software, your team may struggle with visibility, accuracy, and compliance, leading to inefficiencies and risks. Motadata’s IT Asset Management Software simplifies the entire process, from discovery to monitoring, inventory management, and reporting.

A Guide to Streamlined Troubleshooting with Intuitive Log Management Solutions

Efficient troubleshooting is a cornerstone of maintaining smooth operations in modern IT environments. Systems generate immense volumes of data, and sifting through logs without a structured approach can be challenging. Intuitive log management solutions simplify the process, helping IT teams quickly pinpoint issues and enhance system performance. This guide explores the key aspects of leveraging log management tools for seamless troubleshooting.

How Managed IT Services in McKinney, TX Boost Security

For businesses in McKinney, TX, IT managed support has become a practical solution for achieving a secure and reliable technology environment. Managed IT services provide more than just technical support-they offer protection from data breaches, system disruptions, and unauthorized access. By adopting managed IT services, companies in McKinney can minimize security risks while focusing on their core business.

How Log Analytics Powers Four Essential CloudOps Use Cases

Cloud computing shapes the ability of enterprises to transform themselves and effectively compete. By renting elastic cloud resources, enterprises can support new customer platforms, distributed workforces, and back-office operations. The cross-functional discipline of CloudOps helps enterprises manage cloud resources by optimizing applications and infrastructure.

New API endpoint to delete all source maps

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time! Our special thanks to Kelvin from the U.K.

Four Practical Ways to Grow Your Microsoft DEM Business

Our previous blog explained why enterprises need help managing Microsoft 365 and Microsoft Teams. If you can step in with a digital experience management (DEM) offer, you can create opportunities to expand into new markets, establish recurring revenue streams, cement customer loyalty and improve your margins. The challenge is that every enterprise is going to have its own specific pain points.

Managing the Microsoft Experience is an Open Opportunity for MSPs

Few solutions are more essential to enterprise productivity and collaboration than Microsoft 365 and Microsoft Teams. Microsoft 365 is the second most-used office suite in the world, with a 46% share of the market. Microsoft Teams had more than 320 million monthly active users by the start of 2024 and continues to grow, especially thanks to the integration of Copilot AI and value-adds like Teams Rooms and Teams Phone.

Duolingo: Speaking the Language of Observability with Honeycomb

In the world of digital language learning, Duolingo stands out as a beacon of innovation and user engagement. With millions of users worldwide, their platform is designed not only to teach languages, but also to create a fun and engaging learning experience. Running on the robust AWS cloud infrastructure, Duolingo manages vast amounts of data and user interactions daily. As the company experienced rapid growth, Duolingo remained steadfast in their commitment to delivering a high-quality user experience.

Lightrun Unveils Game-Changing Visual Studio Extension and Dynamic Traces at AWS ReInvent 2024

As we kick off the AWS re:Invent 2024 conference, we’re thrilled to introduce two major developer observability and live debugging advancements that bring even greater power and flexibility to developers and engineering teams everywhere. These new product capabilities — the Lightrun Visual Studio Extension and Lightrun Dynamic Traces — are designed to elevate customers’ observability workflows and streamline their development processes directly within their IDE.

Unlocking Insights with Heroku Logs: Complete Guide

Heroku is a popular platform for deploying and scaling applications, and one of its standout features is its centralized logging system. Heroku logs give you visibility into your application’s behaviour, infrastructure events, and platform activities. When paired with a robust monitoring solution like Atatus, you can transform raw log data into actionable insights that keep your applications running smoothly.

How Datadog migrated its Kubernetes fleet on AWS to Arm at scale

Over the past few years, Arm has surged to the forefront of computing. For decades, Arm processors were mainly associated with a handful of specific use cases, such as smartphones, IoT devices, and the Raspberry Pi. But the introduction of AWS Graviton2 in 2019 and the adoption of Arm-based hardware platforms by Apple and others helped bring about a dramatic shift, and Arm is now the most widely used processor architecture in the world.

Achieve total app visibility in minutes with Single Step Instrumentation

Datadog APM and distributed tracing provide teams with an end-to-end view of requests across services, uncovering dependencies and performance bottlenecks to enable real-time troubleshooting and optimization. However, traditional manual instrumentation, while customizable, is often time consuming, error prone, and resource intensive, requiring developers to configure each service individually and closely collaborate with SRE teams.

Monitor your OpenAI LLM spend with cost insights from Datadog

Managing LLM provider costs has become a chief concern for organizations building and deploying custom applications that consume services like OpenAI. These applications often rely on multiple backend LLM calls to handle a single initial prompt, leading to rapid token consumption—and consequently, rising costs. But shortening prompts or chunking documents to reduce token consumption can be difficult and introduce performance trade-offs, including an increased risk of hallucinations.

Smarter Operations: How Rollbar + GrowthBook Minimize Downtime and Boost Reliability

Software development and operations teams are the guardians of system stability, ensuring uptime, reliability, and performance across complex software ecosystems. The stakes are high—every second of downtime impacts your brand’s reputation and bottom line. That’s why integrating Rollbar’s error monitoring with GrowthBook’s feature flagging is a game-changer for ops teams.

Reflecting on Site24x7's digital experience monitoring for the year 2024!

Last year, we made significant progress at Site24x7. We focused on delivering new features and updates to improve your monitoring experience. Our releases and enhancements this year were more focused towards including more metrics, widening your visibility into the performance of your resources, and ensuring that you're not missing out on even the minutest data. We hope you've found these improvements valuable.

Simplify OpenTelemetry Metrics with Cribl Edge OTLP Conversion

Cribl Edge can send data to OpenTelemetry in several different ways. In this blog post, we’ll focus on the OpenTelemetry Metrics. In the blog, we’ll talk about Cribl Edge, but what we say applies to Cribl Stream, too! We will cover how to use Cribl Edge to collect Linux System Metrics, transform them into the OTLP Metrics format, and deliver them to an OTLP Destination.

The Leading SNMP Monitoring Tools

SNMP, which stands for Simple Network Management Protocol, is often viewed as a legacy protocol, with SNMP not being actively worked on anymore, which led to both Microsoft and Google pronouncing that SNMP was dead. Yet, SNMP is still commonly used by numerous industries as the advantages of SNMP, especially for network monitoring, are profound. Practically, all network components across all vendors possess built-in SNMP capability.

Easiest Way to Monitor Your API Endpoints Using Telegraf

Monitoring the health of your API endpoints is crucial to keeping your applications running smoothly and ensuring users have a reliable experience. Keeping an eye on 4XX and 5XX status codes can help you spot issues like client errors, misconfigurations, or server problems before they get out of hand. Plus, setting up alerts for when these errors spike allows you to react quickly, fix problems, and maintain a high-quality service that your users can count on.

Uptime vs. Availability: What's the Difference and Why It Matters

In June 2019, a curious thing happened. Students were forced to go fully analog, putting pencil to paper when they couldn’t log in to their Google Classroom accounts. Avid media consumers sat staring blankly at buffering YouTube videos. Gmail notifications came to a screeching halt as inboxes sat eerily quiet. It wasn’t that the Google Cloud Platform had crashed — far from it.

The evolution of Grafana Cloud Synthetic Monitoring: new features, pricing updates, and more

With 2024 coming to a close, it’s a good time to reflect on how Grafana Cloud has evolved this year — and synthetic monitoring, in particular, is one area where we’ve really focused our efforts. In May, we rolled out a revamped version of Grafana Cloud Synthetic Monitoring with the overall goal of making your monitoring processes not just more efficient, but more impactful.

How to query private network data without an agent using AWS and Grafana Cloud

Connecting to data sources in a private network or an Amazon Virtual Private Cloud (Amazon VPC) can require extra attention to the network security configuration to prevent unintended network exposure. For example, if you wanted to query a network-secured data source, like a MySQL database or an Elasticsearch cluster, that is hosted in an on-premises private network, you would need to open your network to inbound queries from a range of IP addresses.

Prometheus Blackbox Exporter vs Kuberhealthy for K8s monitoring

We all implement tools to monitor our nodes and keep our entire cluster up and running. But how often do updates, failures, or errors mean that users suffer outages, even though our status boards look green? As Kubernetes has enabled more complex microservice architecture, the gap between the state of the dashboard, and the health of services for the user, has grown wider.

How to Fix "Upstream Connect Error" in 7 Different Contexts

The error "upstream connect error or disconnect/reset before headers. reset reason: connection failure" has become a challenge for DevOps teams. This critical error, occurring when services fail to establish or maintain connections with their upstream dependencies, can significantly impact system reliability and user experience.

SolarWinds Observability SaaS: Visibility across cloud-native, on-prem, and hybrid IT stacks.

Ready to transform your IT operations? SolarWinds Observability unifies your entire tech stack—network, infrastructure, apps, databases, and user experience—into one seamless platform. Gain business-level insights, analytics, and automation to optimize performance and ensure availability. Monitor everything: from cloud infrastructure to network devices, all on a single dashboard. With health scores, dynamic dependency maps, and detailed log analysis, pinpointing issues has never been easier.

Secure your cloud environment from end to end with Datadog Infrastructure-as-Code Security

Infrastructure-as-code (IaC) tools like Terraform and CloudFormation allow teams to define, manage, and provision their cloud infrastructure using code, as opposed to clicking through consoles or executing commands via a CLI. IaC adoption is now widespread and helps teams increase productivity and efficiency, but it also introduces new surface area for mistakes, defects, and other risks.

Why Do Organisations Choose Splunk's Observability Solution to Improve Digital Resilience?

Listen to Patrick Peeters, Observability Advisor at Splunk to learn more about how Splunk's modern observability tools are rapidly evolving to meet organisations' demands for scalability, ease of use, real-time insights, and AI to improve their digital resilience.

The future is now, introducing Dynamic Observability from AI innovations built on logs

A year ago, I shared my thoughts at re:Invent, explaining why I joined Sumo Logic as CEO and laid out the importance of logs as a key differentiator. A year later, the atomic level of logs is even more paramount. It’s not just because Sumo Logic is years ahead in technology when it comes to ingesting and analyzing structured and unstructured logs.

Real-time Monitoring - Guide to Real-time Network Monitoring

Maintaining a reliable and secure network is essential for businesses of all sizes. Real-time network monitoring has become crucial, allowing organizations to monitor their network performance and security at every moment. This guide will explore what real-time monitoring entails, how it works, and why it matters for your organization. We will also look into some popular tools like SolarWinds Observability, Datadog, ManageEngine, and Paessler.

EnterpriseDB vs. PostgreSQL

When it comes to choosing the right database management system (DBMS) for your enterprise needs, PostgreSQL—also known as Postgres—and EnterpriseDB (EDB) Postgres Advanced Server often top the list of contenders. While Postgres is a widely adopted open-source database known for its stability and feature-rich capabilities, EDB builds on this with additional tools and enterprise-focused features.

From ELK Stack to easy - Elastic Observability on Elastic Cloud Serverless

Announcing the general availability of Elastic Observability on Elastic Cloud Serverless — a fully managed observability solution As organizations scale, an observability solution that can handle the complexity of distributed cloud environments and provide real-time insights often feels like an insurmountable challenge often due to data- and cost-related compromises.

Centrally manage Agent upgrades and configurations with Datadog Fleet Automation

Teams can gain deep visibility into their applications and infrastructure by installing Datadog’s client-side agent software—the Datadog Agent—throughout their environment. And to help ensure the Agent is deployed correctly and consistently, Datadog’s Fleet Automation feature already helps teams centrally view Agent installations and configurations. But teams also need an easier way to manage the deployment and configuration of the Agent at scale.

API update: Manage source maps

We’re thrilled to announce the latest endpoints for the Raygun API - Source maps. This new release allows developers to efficiently add or remove their sourcemaps, with increased flexibility and control over their Raygun platform. The Raygun API now gives you multiple endpoints to manage your JavaScript source maps, making handling error tracking for your web apps easier than ever.

New host filter for Real User Monitoring

This enhancement is part of Raygun’s 12 Days of Christmas 2024. Over the next few weeks, we’ll share daily updates on bug fixes and feature improvements inspired by feedback from you, our customers. These are the small but impactful changes you’ve asked for, designed to make Raygun faster and easier to use. Check back tomorrow for the next update and see how we’re leveling up your experience one day at a time! A special thanks to Dan from Michigan who suggested this great idea!