Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

The Role of the SRE in the Incident Management Process

In the world of modern businesses, where IT systems play a major role in all types of businesses, the role of the Site Reliability Engineer (SRE) has become central to managing the effectiveness and reliability of the entire business. SREs are the bridge between the rapid deployment of software and systems and the stable operation of those systems in a production environment. They ensure that reliability and performance criteria are defined and are met.

NIST Incident Response Steps & Template | Blameless

The National Institute of Standards and Technology (NIST) provides the framework to help businesses mitigate cybersecurity risks. The framework also protects networks and data, outlining best practices to inform decisions that save time and money. Creating a cybersecurity strategy that identifies, protects, detects, responds, and helps you recover from cybersecurity incidents is critical in the evolving threat landscape.

How to Comply With the SEC's New Cybersecurity Rule

On July 26, 2023, the Securities and Exchange Commission (SEC) introduced new rules regarding cybersecurity risk management, strategy, governance, and incidents. Public companies subject to reporting requirements must comply with the changes to avoid rescission and other monetary penalties, not to mention the risk of legal action and reputation damage. Here, we look at the two new cybersecurity rules and how your company can comply. ‍

The Power of Building a Blameless Culture in IT Operations

In the world of high-scale, high-availability, high-performance web applications, mistakes in IT operations are inevitable. Systems fail, bugs slip through, and outages occur. Your team's approach to responding to these incidents significantly impacts their overall productivity, morale, and effectiveness. Company culture, such as that associated with a blameless culture, is crucial to driving the behaviors that make your business a success.

Getting Buy-in from Management on Reliability Investments

If you’re reading the Blameless blog, you probably have a good idea of how important reliability is to your customers’ happiness, your business’s bottom line, and your overall sanity. Unfortunately, this perspective is frequently downplayed by management. Even if they understand the importance of reliability, they often see it as something that should emerge automatically from having the right mindset, and not something that requires investment.

Blameless CommsAssist - 3 Tips on Making Incident Communication Easy

When you’re in the thick of an incident, communication is both essential and challenging. A wide variety of stakeholders will need timely updates on the situation in order to respond effectively. At the same time, breaking away from the actual diagnostic and resolving work to send these updates can massively slow progress.

The Catchpoint 2024 SRE Report - Five Key Takeaways

Only emerging into the mainstream in the 2010s, SRE is a relatively new discipline in tech. It’s been rapidly adopted by a widening variety of organizations, implementing constantly evolving practices. For the last six years, Catchpoint has been running a survey to take the temperature of the latest developments and trends. Check out the full report here, and read on to see our analysis on five key takeaways.

A Little Resilience Goes A Long Way

‍ Let’s call this the mother of all understatements. If you’re reading this blog, there’s a good chance that you: ‍ a.) Agree wholeheartedly with this sentiment and think it should go without saying, AND… b.) Are surrounded by folks who pay lip service to this idea while not taking it as seriously as they should.