Operations | Monitoring | ITSM | DevOps | Cloud

January 2024

How Organizations Hire SRE's- Laterals or Internal?

Securing reliable system operation necessitates building a formidable Site Reliability Engineering (SRE) team. However, a critical strategic decision confronts every organization: do we cultivate SRE talent internally or venture into the external talent pool? Both approaches possess distinct advantages and disadvantages, each impacting the composition, skillset, and overall effectiveness of the SRE team.

8 Strategies for Reducing Alert Fatigue

Site Reliability Engineers (SREs) and DevOps teams often deal with alert fatigue. It's like when you get too alert that it's hard to keep up, making it tougher to respond quickly and adding extra stress to the current responsibilities. According to a study, 62% of participants noted that alert fatigue played a role in employee turnover, while 60% reported that it resulted in internal conflicts within their organization.

Tech is Easy, People are Hard - Incidentally Reliable with Suresh Kumar Khemka(Head of Infra @apna)

Settle in and listen to Suresh Kumar Khemka(Head of Platform & Infra at apna) talk about platform engineering, balancing bureaucracy and velocity at startups and Tech Giants, and the rippling impact of an e-commerce's downtime. Exclusively on The Incidentally Reliable podcast — made by SREs for SREs, hosted by Zenduty.

Non-Abstract Large System Design (NALSD): The Ultimate Guide

Non-Abstract Large System Design (NALSD) is an approach where intricate systems are crafted with precision and purpose. It holds particular importance for Site Reliability Engineers (SREs) due to its inherent alignment with the core principles and goals of SRE practices. It improves the reliability of systems, allows for scalable architectures, optimizes performance, encourages fault tolerance, streamlines the processes of monitoring and debugging, and enables efficient incident response.

How to Calculate and Minimize Downtime Costs

Downtime is an unwelcome reality. But, beyond the immediate disruption, outages carry a significant financial burden, impacting revenue, customer satisfaction, and brand reputation. For SREs and IT professionals, understanding the cost of downtime is crucial to mitigating its impact and building a more resilient infrastructure.