Operations | Monitoring | ITSM | DevOps | Cloud

Managing External-DNS & cert-manager with Komodor

Recently we’ve explored the evolving role of Kubernetes as a full ecosystem, rather than just a platform, diving into the power and complexity of add-ons. These tools, as highlighted previously, are key to augmenting Kubernetes core capabilities, and adding-on (as their name implies) essential capabilities not supported directly by Kubernetes itself.

Finding Your Way: Using Metrics to Explore Organizational Architecture

Imagine being the new developer in a bustling tech company. Everyone is rushing to meet deadlines, and no one has time to explain the tangled web of services, databases, and messaging systems that make up the organization’s architecture. You search high and low for documentation, but the few diagrams you find are outdated or incomplete. Feeling lost? This is where metrics can come to the rescue.

The importance of error budgets for SREs and how to monitor them

Digital-first customers who are always on the go expect a seamless experience. But let’s face it—100% uptime is a myth. Trying to achieve it can drain resources and stifle innovation. This is where error budgets come in. They help site reliability engineers (SREs) find the sweet spot between delivering reliability and development velocity. With error budgets, teams can focus on building a robust system without burning out over perfection.

DeepSeek vs Llama vs GPT-4 - Open-Source AI models compared

Artificial intelligence is no longer a futuristic concept—it is shaping how businesses operate, how researchers innovate, and how people interact with technology. Models like DeepSeek-R1 , a promising new entrant, alongside established players such as Llama 3 and GPT-4o, are at the forefront of this transformation. These tools are not just about technological advancement; they are about solving real-world problems and driving meaningful progress.

Why Monitoring as Code Is the Future of Application Reliability for Modern Teams... and how it can save you $1 million!

I recently talked to a customer of Checkly and he shared some thoughts about Monitoring as Code. Let’s call him Karl in this article. Karl and I talked about why Monitoring as Code (MaC) is becoming essential for teams operating at scale. As the Head of Platform Engineering at a major e-commerce company processing millions of transactions daily, his experience shows how MaC solves a lot of the messy challenges that come with traditional synthetic monitoring setups.

Realizing the business value of OpenTelemetry-native observability

Transform your organization's observability strategy with open standards and simplified data collection Modern organizations face an unprecedented observability challenge. As systems grow more complex and distributed, traditional monitoring approaches are struggling to keep pace. With data volumes doubling every two years and systems spanning multiple clouds and technologies, organizations need a new approach to maintain visibility into their operations.

How To Monitor Status Pages of Popular Apps With Cloud Status

Remember the last time you noticed your app was acting weird, only to discover — after 30 minutes of debugging — that a critical service was down? We’ve all been there, frantically clicking through various status pages trying to figure out what’s broken, wishing you knew how to monitor status pages of your third party dependencies.

Get One Step Closer to the Dark NOC with Incident Response Automation

Imagine a world where your Network Operations Center (NOC) runs so smoothly that it practically disappears into the background—no manual ticket triaging, no frantic war rooms, no all-nighters spent chasing false alarms. That’s the dream of a Dark NOC—a fully autonomous operations center where automation takes the wheel, reducing human intervention to a bare minimum.

Simplify DevOps tasks with this go-to cheat sheet: From Go programming to automation

DevOps is a dynamic field that bridges development and operations, ensuring seamless collaboration and faster software delivery. Whether you're just starting or looking to sharpen your skills, having quick access to essential concepts is invaluable. That’s why we’ve created a DevOps cheat sheet that covers everything from programming fundamentals to scripting and website building. This cheat sheet is your go-to resource for mastering DevOps tools, languages, and workflows.

How the Gremlin agent fails safely

Testing shouldn’t feel risky. While it might sound counterintuitive, certain types of testing can actually increase risks to your systems. Load testing, for example, is a great way to see how your systems behave under pressure, but it can also cause those same systems to fail if they aren’t equipped to handle the load. For some types of testing, this is necessary, as is the case with reliability testing and Chaos Engineering.