Operations | Monitoring | ITSM | DevOps | Cloud

A hands-on guide to work with MindSpore on Kubeflow

Looking at the report that Gartner did in 2022 regarding top technology trends, AI engineering represents an important pillar in the near future. It is composed of three core technologies: DataOps, MLOps and DevOps.The discipline’s main purpose is to develop AI models that can quickly and continuously provide business value. For instance, models that enable cross-functional collaboration, automation, data analysis, and machine learning.

Charmed Kubeflow now integrates with MindSpore

On 8 November 2022, at Open Source Experience Paris, Canonical announced that Charmed Kubeflow, Canonical’s enterprise-ready Kubeflow distribution, now integrates with MindSpore, a deep learning framework open-sourced by Huawei. Charmed Kubeflow is an end-to-end MLOps platform with optimised complex model training capabilities designed for use with Kubernetes.

Observability is Still Broken. Here are 6 Reasons Why.

In an era where there’s no shortage of established best practices and tools, engineering teams are consistently finding their ability to prevent, detect and resolve production issues is only getting harder. Why is this the case? Our most recent DevOps Pulse Survey highlighted alarming trends to this end.

A Magic Quadrant Leader in ITSM Platforms for ninth year in a row

I’m proud and humbled to announce ServiceNow is a Leader in the 2022 Gartner® Magic Quadrant™ for IT Service Management (ITSM) Platforms for the ninth year in a row. We also ranked first in all three use cases evaluated in the 2022 Critical Capabilities for ITSM Platforms: Service Desk (3.76/5), Service Operations (3.73/5), and Business Workflow Automation (3.81/5).

Empower the SREs - Conclusions from The SRE Report 2023

Let's be honest, nobody loves surveys. Ok, well I sure don't. But surveys satisfy a huge need in our demand for insights into complex human-computer, sociotechnical systems. It turns out that we've been measuring the computer part pretty well, but the humans – not as easy to keep track of. When Google SRE first defined toil as a metric we wanted to reduce, we spent far too long trying to quantify it numerically based on tooling and insights from computer systems.

I've Made a Huge Mistake: Implementing Agile on Infrastructure Teams

Bad planning methods can damage team morale and prevent teams from improving the systems they maintain. In this talk, Sam Handler from Shopify explains how his attempts to fix poor infrastructure planning processes through Agile methods failed. Drawing from this experience, he offers several principles that can help infrastructure teams improve the way they work.

Scaling Up, One Network Bottleneck at a Time

Processing data at scale involves moving packets through a network—but what happens when that network isn't cooperative? Anatole Beuzon, a Software Engineer at Datadog, discusses how he investigated and resolved network issues in Datadog’s larger data-processing apps and how you can apply these same methods to your own production workloads.

Ask a Site Reliability Engineer (SRE)

Site reliability engineering (SRE) can be complicated, and at Datadog, we’ve spent a lot of time thinking about SRE and refining how we implement it. Join Datadog’s Brandon West and Rick Mangi as they provide a brief overview of SRE and its core concepts. This video also contains a Q&A session from the live taping of this panel.

FinOps and Cloud Cost Optimization

As companies scale, it’s become increasingly important to keep cloud cost management and optimization top of mind. In this talk, Yuval Yogev from Sygnia walks you through Sygnia’s optimization journey of cutting their total cloud costs in half. Yogev also shares insights into how you can optimize your own organization’s cloud usage and spend.