Operations | Monitoring | ITSM | DevOps | Cloud

Building a Distributed Security Team

In this live stream, Cjapi’s James Curtis joins me to discuss the challenges of building a distributed global security team. Watch the full video or read on to learn about some hard-won examples of how to be successful with remote team building and management. Talent is hard to find, and companies are hiring from all over the world to build the best teams possible, but this trend has a price.

Machine Learning for Fast and Accurate Root Cause Analysis

Machine Learning (ML) for Root Cause Analysis (RCA) is the state-of-the-art application of algorithms and statistical models to identify the underlying reasons for issues within a system or process. Rather than relying solely on human intervention or time-consuming manual investigations, ML automates and enhances the process of identifying the root cause.

Find Trending Problems Faster with Escalating Issues

Knowing what issues to hit the snooze button on, or drop everything and push a hotfix for is a common developer dilemma. Similarly to what was discussed in Sleep More; Triage Faster with Sentry, we’ve been collecting and iterating on customer feedback for ways to reduce issue noise and surface high-priority issues faster.

How to Build an Internal Developer Platform: Everything You Need to Know

Enter into the Internal Developer Platform (IDP) world, a game-changing solution that empowers development teams to streamline workflows, optimize resource management, and enhance productivity. This article is your comprehensive guide to building an Internal Developer Platform, demystifying its components, benefits, and best practices.

Gateways and BindPlane

The BindPlane Agent is a flexible tool that can be run as an agent, an aggregator, or both. As an agent the collector will be running on the same host it's collecting telemetry from, while an aggregator will collect telemetry from other agents and forward the data on to their final destination. Here are a few of the reasons you might want to consider inserting Aggregators into your pipelines: Today we will examine these reasons, and some possible architectures for implementing aggregators.

Automate Agent installation with the Datadog Ansible collection

Ansible is a configuration management tool that helps you automatically deploy, manage, and configure software on your hosts. By turning manual workflows into automated processes, you can quicken your deployment lifecycle and ensure that all hosts are equipped with the proper configurations and tools. The Datadog collection is now available in both Ansible Galaxy and Ansible Automation Hub.

How a simple metric drives reliability culture at Slack

How do you track reliability in an organization with hundreds of engineers, dozens of daily production changes, and over 32 million monthly users? Even more, how do you do this in a way that's simple, presentable to executives, and doesn't dump a ton of extra work on to engineers' plates? Slack recently wrote about how they created the Service Delivery Index for Reliability (SDI-R), a simple yet comprehensive metric that became the basis for many of their reliability and performance indicators.

How To Create an Incident Communication Plan

Every business faces incidents, no matter how tight-knit or high-tech. Downtime, glitches, system failures, and security breaches are all part of online operations. So all companies must prepare to face such issues, including communicating them to key stakeholders. Take widespread data breaches, for example. If a breach occurs, a business might need to communicate with hundreds or thousands of stakeholders, including DevOps teams, affected accounts, investors, corporate leaders, and media outlets.