Operations | Monitoring | ITSM | DevOps | Cloud

Blog

ObservabilityCON Day 2 recap: The latest Grafana Cloud tools for Prometheus to improve alerting, debugging, and scaling. Plus why continuous monitoring matters now

ObservabilityCON 2020 is live! This week Grafana Labs is bringing together the Grafana community for talks dedicated to observability. We hope you’re able to catch the great sessions we have planned. You can find the full schedule on the event page, and for additional information on viewing, participating in Q&As, and more, check out our quick guide to getting the most out of ObservabilityCON. Day 2 was dedicated to all things Prometheus — featuring new solutions and in-depth case studies.

Top 10 Log Monitoring Reports You Must Have

Log monitoring can be a tedious process. When you have logs, you generate numerous log files in the log database that you need to track. Though a log file parser can help you search through multiple or large logs easily, it’s typically one of those processes which we only look at once it stops working. The windows system logs contain operating system logs as well as logs from applications such as Internet Information Services (IIS) and SQL server.

Kubeflow operators: lifecycle management for the ML stack

Canonical, the publisher of Ubuntu, releases Charmed Kubeflow, a set of charm operators to deliver the 20+ applications that make up the latest version of Kubeflow, for easy consumption anywhere, from workstations to on-prem, public cloud, and edge. > Visit Charmed-kubeflow.io to learn more. Kubeflow provides the cloud-native interface between Kubernetes, the industry standard for software delivery and operations at scale, and data science tools: libraries, frameworks, pipelines, and notebooks.

Scaling Puppeteer & Playwright on Checkly with Terraform

Managing large numbers of checks by hand quickly becomes cumbersome. Luckily, Checkly's REST API allows us to automate most of the repetitive steps. Building on that API, the Checkly Terraform Provider takes automation one step further, enabling users to specify their active monitoring setup as code. In this article, we will be building on top of John Arundel's great intro from a few months back and showing how to manage multiple checks using groups and shared code snippets.

Netdata's dashboard: open by default and secure by design

Let’s talk through a scenario: You have a Linux-based VM running on DigitalOcean (aka a Droplet), and you install Netdata on it using our recommended kickstart script. As the installation process winds down, the Droplet starts up the Netdata Agent’s web server and serves the local Agent web dashboard on port 19999.

Build Organizational Trust With PagerDuty Business Response

Imagine the following scenario: A large retailer experiences a major IT incident that impacts their point-of-sale systems. Their on-call engineers are alerted to the issue and begin their work to resolve it immediately. Behind the scenes, teams are collaborating on a fix, but in the storefront, frustration and tension are growing. Customers are complaining about not being able to check out, and in-store personnel have no good answers as to why the outage happened—or when it will be resolved.

Samurais Do NOT Use UIs: Using CLI To Configure Codefresh And Create And Manage Kubernetes Pipelines

Are you a ninja? It’s a silly question. I know that you are most likely not a real ninja. But you might be considering yourself a ninja of software engineering. “What does Viktor mean by that?” I’m glad you asked. Ninjas appear, perform the mission, and disappear without leaving a trace behind. “Why is Viktor talking nonsense? What does that have to do with software engineering?” Again, I’m glad you asked.

Performance Improvements, Reliability, and Feature Flag Mishaps

Last October, I published a blog post describing the efforts we've committed to on the Bitbucket Cloud engineering team to achieve world-class reliability. A lot has happened in the past year (understatement of the year)! What the team has accomplished is tremendous, but we've also learned a thing or two that we can work further to improve.