Operations | Monitoring | ITSM | DevOps | Cloud

Infrastructure dashboards: Declutter your monitoring data and ensure you're not overspending

The task of monitoring and managing an entire network, including all the servers and applications that run on it, is by no means easy. With so many components of varying complexity, the volume of performance data coming at you can be overwhelming. This information overload increases the chances of missing data that could help discover performance inefficiencies.

Deploying a Containerized App in Google GKE

Because of its popularity and widespread adoption, Kubernetes has become the industry’s de facto for deploying a containerized app. Google Kubernetes Engine (GKE) is Google Cloud Products’ (GCP) managed Kubernetes service. It provides out-of-the-box features such as auto-scaling nodes, high-availability clusters, and automatic upgrades of masters and nodes. In addition, it offers the most convenient cluster setup workflow and the best overall developer experience.

How to Use Monitoring Tools to Improve Root Cause Analysis

As an IT manager you would have often heard from your line manager or user ask “Let’s drill down to find the root cause.”? As dreaded a question as it may seem, it is really the most important answer to understand IT outages. IT infrastructure availability is highly dependent on isolating problems, so the deciding variable in a problem can be fixed without putting the entire system at a halt. This is where RCA can be of tremendous help.

Better Python Decorators with Wrapt

Our instrumentation uses built-in extension mechanisms where possible, such as Django’s database instrumentation. But often libraries have no such mechanisms, so we resort to wrapping third party libraries’ functions with our own decorators. For example, we instrument jinja2 ’s Template.render() function with a decorator to measure template rendering time. We value the correctness of our instrumentation a lot so that we do not affect our users’ applications.

Introduction to Site Reliability Automation, Enabled by AIOps from Broadcom

To support digital transformation, organizations are increasingly looking to site reliability engineering, or SRE, approaches to managing complex infrastructures and handling the pace of change induced by DevOps. When issues occur, it’s important to have an integrated toolset that can determine the root cause and apply remediation. In addition, the solution should understand the results of the remediation action, increasing the confidence level in applying the right actions to other similar issues. All of this results in site reliability automation - enhancing infrastructure monitoring with intelligent recommendations and auto-remediation capabilities to help SREs create more resilient production environments. To learn more, go to broadcom.com/aiops