"The secret to success is to be in harmony with and adapt to the ever-changing trends and patterns of the world."
- Bruce Lee
I’ve spent most of my career trying to solve big problems for people. In the early days at New Relic, we were trying to help people scale their systems based without compromising on performance, cost, or the customer experience. Not an easy feat but we gave them a solution that allowed them to accomplish their goals. The key was religiously listening to our customers talk about their wants, needs, hopes and fears. While I am rarely the smartest person in the room, which my partner rarely misses a chance to lovingly remind me, I always do my best to listen to what the brilliant folks in my sphere are talking about.
That’s critical to how we work at Blameless. It's both thrilling and humbling to witness the evolving landscape of Site Reliability Engineering. SRE practitioners and leaders are shaping the future of software development as we speak. Given my admittedly privileged position, I’d like to share some of what I’ve garnered about the year ahead with you. So, let's dive into what I think are the top five trends being discussed by SRE leaders in 2023.
1. SRE is rising to meet a reliability crisis
This sounds dramatic, but once you dive in, you’ll find that it’s maybe an understatement. Think about when you see tech companies on the front page of news sites. Is it when they release new features? Report profitable quarters? Maybe sometimes. But when there’s a big outage? Every single time. Services like Netflix, platforms like Facebook, and providers like Rogers have become so huge and integral to our lives that any disruption to them is newsworthy.
When you think holistically about how the dominoes fall on these outages – from the revenue lost while the service was unavailable, to the frustrated customers who quit the service, to the long-term damage to the brand reputation and stock value – the stakes become clear. Unreliability is easily a 100 billion dollar per year problem for enterprise organizations.
There’s no overnight fix to make everything reliable. The complexity and demand for these services has outpaced investments in their reliability for too long. To make up for this gap, orgs need to build up an SRE practice from the ground up.
2. AI will accelerate the reliability crisis
Like every other field in tech, AI will have a big impact on SRE. Providing the ability to move faster, in more directions, with less manual intervention and oversite, AI can exponentially compound the complexity of your services, and with it, the chance of incidents. The faster you move, the more often you break.
We’re not AI doomsayers, though, and we don’t think these risks should be enough to dissuade you from experimenting with AI in your services. Instead, you just need to offset your investments in AI with investments in reliability. Good SRE practices can mitigate the unreliability risks of AI, including:
- Having safeguards based on SLOs that slam the brakes on AI experiments when users might become unhappy
- Building robust and consistent incident management processes that effectively handle incidents, even those from novel sources
- Operating with easy-to-rollback releases that can minimize the damage a bad commit could cause
Make your AI investments more beneficial by getting ahead of the risk factors.
3. Embracing Observability: Unleashing the Power of Data
Observability has long been a hot topic in the SRE world, but in 2023, it has taken center stage. SRE leaders are increasingly recognizing the immense value of data in understanding system behavior and identifying bottlenecks. With the rise of complex, distributed systems, traditional monitoring approaches no longer suffice. Observability shifts the focus from merely monitoring to gaining deep insights through logs, metrics, and traces.
Gone are the days of relying on simplistic metrics to gauge system health. Modern SREs understand that a comprehensive observability strategy empowers them to detect anomalies, optimize performance, and troubleshoot issues proactively. Leveraging tools like distributed tracing and advanced logging systems, SRE leaders are able to generate rich telemetry data that provides a holistic view of their systems' behaviors, enabling them to make informed decisions and deliver reliable services to end-users.
4. Adopting Infrastructure as Code: Automating for Resilience
In 2023, SRE leaders are embracing the power of Infrastructure as Code (IaC) to drive operational efficiency and enhance resilience. The days of manually provisioning and managing infrastructure are dwindling, as organizations recognize the need for infrastructure to be treated as software. By defining infrastructure in code, SRE teams can automate the provisioning, configuration, and deployment processes, reducing human error and increasing scalability.
IaC enables teams to spin up and tear down infrastructure rapidly, facilitating experimentation and faster iterations. It also promotes version control and collaboration, ensuring that infrastructure changes are auditable and reproducible. SRE leaders understand that embracing IaC not only accelerates development cycles but also allows them to build and maintain reliable systems in a consistent, predictable manner.
5. Prioritizing Human Factors: Culture, Collaboration, and Well-being
As technology advances, it's crucial not to lose sight of the human element. In 2023, SRE leaders are recognizing the significance of cultivating a strong engineering culture, fostering collaboration, and prioritizing the well-being of their teams. Building robust systems involves more than just technical expertise—it requires effective communication, shared ownership, and a culture of learning and continuous improvement.
SRE leaders are investing in creating a supportive environment that encourages knowledge sharing, cross-functional collaboration, and psychological safety. They understand the value of learning from failures, promoting blameless post-mortems, and embracing a growth mindset. Additionally, leaders are prioritizing the well-being of their teams, recognizing the importance of work-life balance, mental health support, and professional development opportunities.
In 2023, SRE leaders are navigating the ever-changing landscape of software reliability with a focus on embracing observability, adopting Infrastructure as Code, and prioritizing human factors. These leaders are paving the way for resilient, scalable, and reliable systems. We here at Blameless are very excited to be supporting those efforts. If you want to learn more about how effective incident management intersects with these trends, please don’t hesitate to reach out.