Operations | Monitoring | ITSM | DevOps | Cloud

September 2021

Reliability is not an engineering metric

If you're an engineer reading this, you might be wondering what I mean by the title. You might be a Site Reliability Engineer whose primary responsibility is to maintain the reliability of your company’s product/solution. You might be a software builder, a programmer responsible for building new capabilities and shipping them to production. All of these are important for any business to remain competitive.

SRE Back-to-School Checklist

Whether it's in classrooms or on Zoom calls, the kids have headed back to school! Bright-eyed students are gearing up to study new subjects and test their brains. Hopefully on their report cards, failure isn’t inevitable. Before the first day, parents load up their kids’ backpacks with everything they’ll need. Being well equipped with good supplies is the best way to stay focused and educate “reliably”. Likewise, SREs need the right tools and practices for the job.

Modern SRE Practices for Incident Management

At VMware, we make use of modern development and site reliability engineering (SRE) practices on a regular basis. And those of us who work on the VMware Tanzu Observability product marketing team regularly get exposure to various SRE teams that implement modern practices with the observability technology we create.

What is expected in the SRE role? We analyzed 30 job postings to find out.

In 2016, Google released the definitive book on Site Reliability Engineering (SRE) - a practice that had originated in the company to take care of a monumental problem - how to keep the Google services running with high reliability. Over the years, SRE has been widely adopted by dev teams across the globe and is a popular role at startups and enterprises alike. Here is a look at how search for SRE has trended over the years.

A Migration That Paid Tech Dividends

TL;DR: Old, deprecated code/infrastructure is a challenge that every engineer will come across. Remedy what you can and remember that some extra effort can go a long way. It can uncover issues that, when addressed, will save you in the future. Part of the challenge of software development is maintaining legacy code and infrastructure. When you ignore or neglect these, issues start to pop up and your reliability suffers, causing pain for your customers. The trick here is to actively steward each project.

DevOps vs. Agile

Curious about the differences between DevOps vs. Agile development methodologies? We'll explore and compare both approaches. What are the key differences between DevOps vs. Agile? Agile and DevOps are methodologies that share the goal of producing software quickly. In DevOps, Development and Operations work together closely throughout the software development lifecycle process. Agile is an iterative approach that focuses on deploying releases rapidly with small teams.

How Lowe's SRE reduced its mean time to recovery (MTTR) by over 80 percent

The stakes of managing Lowes.com have never been higher, and that means spotting, troubleshooting and recovering from incidents as quickly as possible, so that customers can continue to do business on our site. To do that, it’s crucial to have solid incident engineering practices in place. Resolving an incident means mitigating the impact and/or restoring the service to its previous condition.

Essential Tools for Site Reliability Engineers

Site reliability engineers (SREs) are involved in scaling systems and making them reliable and efficient for organizations. But SREs often fail to build system resiliency when they do not have the right tools at their disposal. In this post, we’ll uncover five leading tools that SREs can use to drive the reliability and stability of computing systems. It also examines how SREs can use the tools to improve operations tasks and infrastructure processes.