Operations | Monitoring | ITSM | DevOps | Cloud

Optimize Kubernetes Performance Part 1: Cluster Configurations

Kubernetes is a powerful platform that comes with many features to help engineers run their applications more efficiently. However, as you gain more experience and deploy more workloads, you’ll inevitably start looking for ways to optimize Kubernetes performance. There are many ways to approach optimization. On one hand, you could work exclusively with the tools and configurations provided by Kubernetes itself; on the other, you could reap the benefits of third-party tools.

Cloud Visibility - First step to your Cloud cost journey

Visibility is often cited as a critical or the first step for any cloud cost journey. But what exactly does everyone mean by visibility, and why is it so important? As organizations move to the cloud, they turn to cloud providers like Amazon Web services (AWS), Microsoft Azure, VMware, and Google Cloud platforms to keep track of the sheer volumes of resources involved in their cloud infrastructure. Here’s the rub – What you see is what you control!

Secure open source MLOps for AI/ML applications in financial services

The adoption of AI/ML in financial services is increasing as companies seek to drive more robust, data-driven decision processes as part of their digital transformation journey. For global banking, McKinsey estimates that AI technologies could potentially deliver up to $1 trillion of additional value each year. But productionising machine learning at scale is challenging.

Profiling 101: What is profiling?

The performance of your app matters. From ensuring a good user experience to retaining users, performance makes a difference in your app’s success. Using the right tools can make it easier to ensure your code is meeting your performance goals, before you have to switch to a bigger EC2 instance or users start complaining. One of the best tools in a developer’s toolbox for ensuring good performance is profiling.

Is Kubernetes Monitoring Flawed?

Kubernetes has come a long way, but the current state of Kubernetes open source monitoring is in need of improvement. This is in part due to the issues related to an unnecessary volume of data related to that monitoring. For example, a 3-node Kubernetes cluster with Prometheus will ship around 40,000 active series by default. Do we really need all that data?

Connecting OpenTelemetry to AWS Fargate

OpenTelemetry is an open-source observability framework that provides a vendor-neutral and language-agnostic way to collect and analyze telemetry data. This tutorial will show you how to integrate OpenTelemetry with Amazon AWS Fargate, a container orchestration service that allows you to run and scale containerized applications without managing the underlying infrastructure.

Announcing our improved Schedules & On-Call Rotations

Hey folks! We are super excited to announce that our schedules feature has gone through a bit of an update. Well, more than a bit 🙂. We’ve gone through the feature with a fine-toothed comb and introduced a bunch of UI and functional improvements which we hope will help you achieve one thing: set up, edit and manage your on-call schedules at scale in a matter of minutes (Yes, that was three things but it was tough to condense it to ONE thing)

Root cause log analysis with Elastic Observability and machine learning

With more and more applications moving to the cloud, an increasing amount of telemetry data (logs, metrics, traces) is being collected, which can help improve application performance, operational efficiencies, and business KPIs. However, analyzing this data is extremely tedious and time consuming given the tremendous amounts of data being generated. Traditional methods of alerting and simple pattern matching (visual or simple searching etc) are not sufficient for IT Operations teams and SREs.

The business value of frequent deployments: Recouped time

The first post in this series introduced the idea of the different layers of value that your business can gain from frequent deployments and focused on the hard costs you can save. We’re looking at the role the database plays here because it’s the most complicated part of the process and it’s difficult to hit aggressive KPIs and goals when your teams are burdened with process bloat due to mistake-prone, manual work.

SRE Report 2023: Findings From the Field - Toil

Toil. Few other words have the same visceral impact for SREs as their four-letter nemesis: toil. Although pretty much everyone recognizes and agrees that toil is bad, it is a term that is frequently misused in colloquial use. In common English usage, toil is defined as “long strenuous fatiguing labor”. As a term of art in the SRE profession, “toil” has several very specific characteristics which distinguish it from other sorts of work which people spend time on.