Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

With SRE, failing to plan is planning to fail

People sometimes think that implementing Site Reliability Engineering (or DevOps for that matter) will magically make everything better. Just sprinkle a little bit of SRE fairy dust on your organization and your services will be more reliable, more profitable, and your IT, product and engineering teams will be happy. It’s easy to see why people think this way. Some of the world’s most reliable and scalable services run with the help of an SRE team, Google being the prime example.

Overview of Incident Lifecycle in SRE

Incidents that disrupt services are unavoidable. But every breakdown is an opportunity to learn & improve. Our latest blog is a deep dive into best practices to follow across the lifecycle of an incident, helping teams build a sustainable and reliable product - the SRE way As the saying goes, “Every problem we face is a blessing in disguise”.

On Not Being a Cog in the Machine

This is my first week here as the first dedicated SRE for Honeycomb, and in a welcoming gesture, I was asked if I wanted to write a blog post about my first impressions and what made me decide to join the team. I’ve got a ton of personal reasons for joining Honeycomb that may not be worth being all public about, but after thinking for a while, I realized that many of the things I personally found interesting could point towards attitudes that result in better software elsewhere.

7 Tips On Building And Maintaining An SRE Team In Your Company

In today's "always on" world, Reliability is a primary business KPI. Plant the culture of Reliability by implementing these 7 simple tips to build a solid SRE team in your organization. Many of today’s hottest jobs didn’t exist at the turn of the millennium. Social media managers, data scientists, and growth hackers were never heard of before. Another relatively new job role in demand is that of a Site Reliability Engineer or SRE. The profession is quite new.

Take the first step toward SRE with Cloud Operations Sandbox

At Google Cloud, we strive to bring Site Reliability Engineering (SRE) culture to our customers not only through training on organizational best practices, but also with the tools you need to run successful cloud services. Part and parcel of that is comprehensive observability tooling—logging, monitoring, tracing, profiling and debugging—which can help you troubleshoot production issues faster, increase release velocity and improve service reliability.

The Key Differences between SLI, SLO, and SLA in SRE

To incentivize reliability in your platform, there should be shared goals across your team to measure & quantify the capabilities of your product/service along with customer experience. Define the path of "Always-On" services by understanding few key SRE fundamentals and their implications - SLIs, SLOs & SLA. Framing SRE metrics for building or scaling a product is quite a daunting task.

2021 is the Year of Reliability

There’s no better time than now to dedicate effort to reliable software. If it wasn’t apparent before, this past year has made it more evident than ever: People expect their software tools to work every time, all the time. The shift in the way end-users think about software was as inevitable as our daily applications entered our lives, almost like water and electricity entered our homes.

Building and Scaling Your SRE Team

Building Site Reliability Engineering (SRE) teams is hard! There are so many articles and explanations of what SRE means, it’s easy to get lost. Going beyond understanding what the individual SRE role is into building and scaling a team of SREs is more of a challenge. It’s important to find the right information that will help you take your SRE team to the next level.

Top Observability tools for DevOps Engineers and SREs

Better visibility is the first step to improved system stability. Our latest blog outlines Top Observability tools for DevOps Engineers & SREs to help you get started on your journey to gain valuable insights into your infrastructure. “We can't fix something which we can't observe” - whether it's a steam engine or a complex microservice based cloud deployment, great observability makes troubleshooting things easier.