Operations | Monitoring | ITSM | DevOps | Cloud

How a simple metric drives reliability culture at Slack

How do you track reliability in an organization with hundreds of engineers, dozens of daily production changes, and over 32 million monthly users? Even more, how do you do this in a way that's simple, presentable to executives, and doesn't dump a ton of extra work on to engineers' plates? Slack recently wrote about how they created the Service Delivery Index for Reliability (SDI-R), a simple yet comprehensive metric that became the basis for many of their reliability and performance indicators.

Automate Agent installation with the Datadog Ansible collection

Ansible is a configuration management tool that helps you automatically deploy, manage, and configure software on your hosts. By turning manual workflows into automated processes, you can quicken your deployment lifecycle and ensure that all hosts are equipped with the proper configurations and tools. The Datadog collection is now available in both Ansible Galaxy and Ansible Automation Hub.

Gateways and BindPlane

The BindPlane Agent is a flexible tool that can be run as an agent, an aggregator, or both. As an agent the collector will be running on the same host it's collecting telemetry from, while an aggregator will collect telemetry from other agents and forward the data on to their final destination. Here are a few of the reasons you might want to consider inserting Aggregators into your pipelines: Today we will examine these reasons, and some possible architectures for implementing aggregators.

How to Build an Internal Developer Platform: Everything You Need to Know

Enter into the Internal Developer Platform (IDP) world, a game-changing solution that empowers development teams to streamline workflows, optimize resource management, and enhance productivity. This article is your comprehensive guide to building an Internal Developer Platform, demystifying its components, benefits, and best practices.

Find Trending Problems Faster with Escalating Issues

Knowing what issues to hit the snooze button on, or drop everything and push a hotfix for is a common developer dilemma. Similarly to what was discussed in Sleep More; Triage Faster with Sentry, we’ve been collecting and iterating on customer feedback for ways to reduce issue noise and surface high-priority issues faster.

Machine Learning for Fast and Accurate Root Cause Analysis

Machine Learning (ML) for Root Cause Analysis (RCA) is the state-of-the-art application of algorithms and statistical models to identify the underlying reasons for issues within a system or process. Rather than relying solely on human intervention or time-consuming manual investigations, ML automates and enhances the process of identifying the root cause.

Building a Distributed Security Team

In this live stream, Cjapi’s James Curtis joins me to discuss the challenges of building a distributed global security team. Watch the full video or read on to learn about some hard-won examples of how to be successful with remote team building and management. Talent is hard to find, and companies are hiring from all over the world to build the best teams possible, but this trend has a price.

How Technology Revolutionized the Quality Control and Inspection Industry

The way people inspect things for quality has changed a lot because of new technology. This field has hugely transformed with the transition from traditional means of inspection to cutting-edge systems. Now, things are much better in ensuring products and processes are top-notch.