This is a guest post by Noah Zoschke, Engineering Manager at Segment. Segment is the customer data infrastructure that makes it easy for companies to clean, collect, and control their first-party customer data. At Segment, our ultimate goal is to collect data from Sources (e.g., a website or mobile app) and route it to one or more Destinations (e.g., Google Analytics and AWS Redshift) as quickly and reliably as possible.
Apache Hive is an open source interface that allows users to query and analyze distributed datasets using SQL commands. Hive compiles SQL commands into an execution plan, which it then runs against your Hadoop deployment. You can customize Hive by using a number of pluggable components (e.g., HDFS and HBase for storage, Spark and MapReduce for execution). With our new integration, you can monitor Hive metrics and logs in context with the rest of your big data infrastructure.
Dashboards provide critical visibility into the performance and health of your environment. But if your organization uses hundreds or thousands of dashboards, or if you’ve recently transitioned to a new company or different team, it’s not always easy to understand the full significance of the data shown on every single dashboard.
Ansible is an automation tool for provisioning, managing, and deploying infrastructure and applications. When building large-scale applications, Ansible enables users to manage and configure their infrastructure across platforms like AWS. Whether you rely on temporary or dedicated hosts, you can use Ansible to create a repeatable process for configuring them with the Datadog Agent.
Apache Ambari is an open source management tool that helps organizations operate Hadoop clusters at scale. Ambari provides a web UI and REST API to help users configure, spin up, and monitor Hadoop clusters with one centralized platform. As your Hadoop deployment grows in size and complexity, you need deep visibility into your clusters as well as the Ambari servers that manage them. If issues arise in Ambari, it can lead to problems in your data pipelines and cripple your ability to manage clusters.