Operations | Monitoring | ITSM | DevOps | Cloud

July 2023

Sponsored Post

Kubernetes Monitoring Best Practices

Kubernetes can be installed using different tools, whether open-source, third-party vendor, or in a public cloud. In most cases, default installations have limited monitoring capabilities. Therefore, once a Kubernetes cluster is running, administrators must implement monitoring solutions to meet their requirements. Typical use cases for Kubernetes monitoring include: Effective Kubernetes monitoring requires a mix of tools, strategy, and technical expertise. To help you get it right, this article will explore seven essential Kubernetes monitoring best practices in detail.

The Medium is the Message: How to Master the Most Essential Incident Communication Channels

We’ve all seen it: a company experiencing a major incident and going radio silent, leaving their customers to wonder “Are they doing something about this?!”. If you’ve ever been on the inside of something like this, you know the answer is most likely yes, there are people working hard to put out the fire as quickly as possible. But when it comes to incidents, perception is reality for customers.

Looking Beyond Atlassian StatusPage: The 5 Best Alternatives

Status Pages are crucial cogs in your Incident Communication process, they serve as vital channels to keep your stakeholders informed during periods of downtime. Although there are many proficient tools in the market, such as Atlassian Status Page and Status.io, these standalone Status Pages can come with a hefty price tag, with various pricing plans and tiers for both Public and Private Status Pages. Moreover, with Atlassian Cloud’s recent issues, its dependability is in question.

Breaking Down the Pillars of Observability from Data to Outcomes

The world of cloud-native and distributed microservices has revolutionized software development and deployment. However, the sheer volume of data these systems generate can often lead to confusion and uncertainty. You're not alone if you've ever felt lost in the sea of observability data.

Webinar: Embracing Declarative Provisioning and Observability in cloud environments

Organizations face increasingly complex challenges in deploying and managing their systems in today's rapidly evolving technological landscape. Declarative provisioning and observability have emerged as a powerful approach to address these challenges. This talk delves into declarative provisioning and observability, exploring its benefits, principles, and practical implementation strategies.

Introduction to ELK Tech Stack

ELK Stack, also known as the Elastic Stack is a powerful and versatile open-source toolset that has revolutionized the way businesses manage and analyze their data. ELK Stack seamlessly integrates these three robust components to offer a comprehensive solution for searching, analyzing, and visualizing large volumes of data in real-time. So, buckle up, for a comprehensive overview of the ELK stack and its components, which will be a great starting point for beginners.

Pinpoint performance issues in downstream services with the Dependency Map Navigator

Visibility into the upstream and downstream dependencies of your services is key to maintaining a performant microservices environment. Application developers and SREs rely on this visibility to quickly trace issues back to the source, which is essential during incidents—when time is of the essence—throughout day-to-day operations, and as systems evolve and scale.

Enhanced Incident Response: Maximizing Microsoft Teams with Squadcast

Off late more and more businesses are relying on ChatOps tools like Microsoft Teams for a range of functions beyond simple communication. Incident management is no exception to this growing trend. However, Microsoft Teams alone may not possess all the necessary capabilities to efficiently perform these functions. To bridge this gap, integration with core applications becomes necessary.

Mastering Zero Trust - Pillars for Security

Zero Trust is a heightened security measure that blocks people and devices from accessing company data by default, only allowing access to those who prove they require it. Zero Trust assumes restricted access to company resources by all: Anyone or anything accessing company resources requires verification each time the system is accessed. There are no options to “trust this device next time” or “save password for next time”.

Templates for Automating Incident Response

A security incident is the last thing any DevOps lead wants to see. Along with the vast number of protocols required to overcome an incident, there’s a hefty amount of paperwork to complete. Security incidents can even lead to legal repercussions, if personal data is leaked. Incident response templates offer insight into: An incident response plan template drastically reduces the time and effort spent dealing with incident reports.

Unveiling Multibot, the "glue" for enterprise workflows

How are you delivering Slack incident management workflows that serve the many teams across your enterprise? How are you addressing the differences in their use cases, access needs, isolation needs, and tech stacks, all while enabling everyone to collaborate? These are challenging questions to answer. To effectively do so, you have a host of conditions to support at the team and company-wide levels: ‍ Team ‍ Company-wide ‍

Video: How to Apply the Golden Signals to Your Monitoring Strategy

The Four Golden Signals, developed by Google SREs, are key metrics used to monitor the health of your systems. In today’s complex IT environments, these key metrics can help engineers and IT operations prioritize the most significant issues to address. The Four Golden Signals include: In the following 9-minute video, I focus on two of these signals in particular, latency and errors, because they often result in customer-facing symptoms.

8 Tips to incorporate the voice of the customer in your story grooming/sprint planning

Creating successful products and projects goes beyond just great ideas and flexible processes. It's about truly understanding and listening to your customers.Attentively listening to their wants and needs unlocks invaluable insights that can revolutionize your story planning and project execution. In this blog, we'll look at easy but powerful tips to use the customer's input during story planning.

Take back control of your Monitoring

The challenges in the monitoring world are known widely. We all know about these problems, what they are, and why they are important. While each one of the problems has its own solution, it all boils down to one thing – COST. How do we balance the tradeoffs without worrying about the huge costs of solving these challenges? For high-precision monitoring and observability, you need efficient and high-precision control levers. Take back control of your Monitoring with Levitate - a managed time series data warehouse.

What Is Site Reliability Engineering? Understanding the complexities of this crucial function

Site reliability engineers manage a lot, and often in incredibly high-stakes environments. Remember that scene from "The Matrix" where Neo dodges bullets in slow motion? Of course you do. As an SRE, it can feel like you're the person getting hit by those bullets, frantically trying to investigate performance issues, automate away toil, and support the engineers around you, all before the next wave of attacks.

Share highly customizable Blameless Retrospectives as ServiceNow Problems

For many organizations, ServiceNow is a crucial platform to run and scale your organization across all departments. Many organizations’ engineering teams have been relying on ServiceNow Incident and Problem Management. Despite that, many have been experiencing a growing volume of incidents hindering their ability to scale not only their incident response but also their retrospective operations, potentially compromising their data governance and compliance requirements.

Understanding Chaos Engineering and its Benefits

In today's fast-paced technological landscape, ensuring the resilience and dependability of systems is crucial. This is where Chaos Engineering comes in, transforming how organizations approach system testing and fortification. Chaos Engineering helps find vulnerabilities that could go undetected under normal circumstances by purposefully introducing controlled interruptions and failures.

26 DevOps Automation Tools that SaaS Loves in 2023 | Blameless

DevOps is a term combining “development” and “operations”. It involves the use of tools and processes to minimize the time and effort spent on software creation and maintenance. Many DevOps technologies use automation to reduce manual tasks. These DevOps automation tools sometimes use AI-based technology to remove human-based operations, or simpler scripting and processing. This increases speed in feedback and performance between development and operations departments.

Improve Visibility and Capture More Data with Triage Incidents

As new incidents emerge, there are often many unknowns about the size, severity, and cause of the problem. Sometimes it’s not clear if the problem is an incident at all. That’s where introducing a triage stage to your incident management process can help. In this post, we’ll look at the benefits of adding a triage layer to your incident management, and how Rootly’s Triage feature allows you to seamlessly transition from triage to real incident (or false alarm).

The Incident Response Lifecycle: Strategies for Effective Incident Management

The nature of security and incident management is cyclical rather than linear. Resolving an issue doesn't mark the end of the team's responsibilities. Instead, it signals the opportunity to enhance reliability, strategize, prepare, and prevent similar problems. This is where the incident response helps and comes into the picture. But what is incident response, and what steps are included in the incident response lifecycle? Let's understand them in detail.

Docker Compose Logs: Guide & Best Practices

Docker Compose is a tool for defining and running multi-container Docker applications. It allows developers to streamline the process of configuring, building, and running multiple containers as a single unit with a docker-compose.yml. This configuration file specifies the services, networks, and volumes required for an application, and their relationships and dependencies. The docker-compose logs command displays the logs of all services defined in the docker-compose.yml file.