Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Deploying to production in <5m with our hosted container builder

Fast build times are great, which is why we aim for less than 5m between merging a PR and getting it into production. Not only is waiting on builds a waste of developer time — and an annoying concentration breaker — the speed at which you can deploy new changes has an impact on your shipping velocity. Put simply, you can ship faster and with more confidence when deploying a follow-up fix is a simple, quick change.

Training Intelligent Alert Grouping

Complex incidents are both exhausting and commonplace. In this case, incidents that I am referring to as “complex” are incidents that involve multiple, disparate, notifications in your alert management platform. Perhaps these incidents are logically separated because the underlying systems or services were seen as less coupled than they turned out to be in reality.

Fail-Safe Digital Scheduler for On-Call Management

In this video, we discuss how OnPage's advanced, fail-proof digital schedules enable organizations to distribute workload evenly among scheduled, On-Call team members. The OnPage scheduler starts out "FULL" and schedules are created on top of it. This guarantees that a notification is delivered reliably, even when a slot is left empty on the scheduler. The scheduler reverts to the default group order and the entire group is notified, ensuring continuous coverage across your organization.

Tis The Season: Protect Your Availability During The Holidays

Deck the halls! It's time for the annual holiday Code Freeze, that festive time of year when businesses impose a precautionary halt to code changes and Operations should be quiet. But before you kick up your feet, make sure that demand doesn’t lead to availability embarrassments. After all, retail experts suggest that we’re in for another online-heavy holiday shopping season, so businesses need to brace for increased digital traffic...with little tolerance for failure.

Partner Integration on Twitch: Lacework

Lacework delivers complete #security and #compliance for the cloud. While the cloud enables enterprises to automatically scale workloads, deploy faster, and build freely, it also makes it increasingly difficult to: maintain visibility, remain compliant, stay free from known vulnerabilities, and track activity in both host workloads and ephemeral infrastructure within their environments. Integrate Lacework with PagerDuty to route Lacework Events to responders on your team. Manage and resolve configuration issues, behavioral anomalies, and compliance requirements in a timely manner across your cloud infrastructure.

5 ways incidents made me a better engineer

Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams to solve unexpected and challenging problems. In my career, I've found incidents can be a great accelerator - for both myself and others around me. It was after leading my first incident at GoCardless that I started to feel really comfortable in the codebase and the team.

Fall 2021 Launch: Automate Incident Response to Accelerate Critical Work

Modern businesses are digital businesses—so managing your business means mastering your critical services and operations for your employees and customers. Today, you need to be able to understand every aspect of your company—as it unfolds—because in this world, seconds matter to your productivity, your revenue, and most importantly, your customers.

IT Failures are Inevitable

As infrastructure stacks grow increasingly complex and involve an ever-growing number of services, system failures are becoming more and more common. There can be a variety of reasons why systems fail: software bugs, misconfiguration or interactions between services that cause unexpected behavior, the network is down, and of course, those rare occasions where natural events can render data centers inoperative.