Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

incident.io: A scalable incident management solution built for enterprises

For enterprise businesses, a lot is riding on the efficiency of their incident response. These organizations have large customer bases, complex products, and many incidents. They also have loads of incident responders across various roles, making it difficult to coordinate internally.

Unveiling Squadcast's Enhanced Status Pages

Meet Kevin and Mai (again): Navigating the Troublesome Waters of Platform Downtime. Kevin is a Site Reliability Engineer (SRE), constantly on the lookout for potential downtime that could impact their platform, kryptobro.com. Mai is his adept partner, ever-ready to troubleshoot. In their journey, the previous version of Squadcast Status Pages served as a helpful tool, but they soon found room for improvements.

Anything But Tech Debt

Tech debt is usually one of the most fraught topics on engineering teams. Engineers often feel they aren’t allowed enough time to address tech debt. Product partners wonder why engineers spend so much time working on it—or at least talking about it. “The business” always seems to insinuate that engineers should do less of it, instead focusing on shipping value to customers.

Using UX and Observability to Track Application Health

UX (user experience) is a core factor that determines the success of an application or platform in a distributed system. Specifically, developers need to understand the infrastructure within an entire application stack to improve and refine the user experience to meet customer expectations without guesswork. System downtime remains a significant source of revenue and reputational losses for enterprises, employees, and customers.

Here's what it feels like to deploy every day

Here's what it feels like to deploy every day. With Sleuth, Gigpro's software engineering team went from one deploy every two weeks to once a day. That made releases less stressful and helped improve team culture. Give Sleuth a try and see how we empower software teams to build faster by making engineering efficiency easy to improve and measurable — in a way that both managers and developers love.

cert-manager can do SPIFFE? - Civo Navigate NA 2023

Ashley Davis, Senior Software Engineer and Maintainer of cert-manager, discusses the capabilities of cert-manager, an easy way to manage certificates in Kubernetes clusters. Ashley highlights the importance of Trust-manager for managing trust bundles, enabling clients to verify certificate legitimacy. Additionally, he explores the potential of using x509 certificates as a universal identity control plane in distributed systems through the concept of "SPIFFE" (Secure Production Identity Framework For Everyone).

Enhanced Ubuntu Experience on Azure: Introducing Ubuntu Pro Updates Awareness

In collaboration with Microsoft, Canonical introduces Ubuntu Pro update notifications into the Azure Update Management Center. This feature enables users to identify Ubuntu instances that aren't receiving all available security updates, including those delivered via Ubuntu Pro. Ubuntu Pro, a subscription by Canonical, provides enhanced security, maintenance, and compliance tools for organizations using Ubuntu on Azure.

Kubernetes Incident Management Best Practices

Creating just any infrastructure on Kubernetes is not enough. There are so many basic configurations you could apply and create the infrastructure for your application for the time being and it might work just fine. The incident responses won’t always remain 100% reliable. You will run into newer potholes, and that’s okay.

Getting started with AWS CloudWatch

Out of more than 100 services that Amazon Web Services (AWS) provides, Amazon CloudWatch was one of the earliest services provided by AWS. CloudWatch was announced on May 17th, 2009, and it was the 7th service released after S3, SQS, SimpleDB, EBS, EC2, and EMR. AWS CloudWatch is a suite of tools that encompasses a wide range of cloud resources, including collecting logs and metrics; monitoring; visualization and alerting; and automated action in response to operational health changes.