San Francisco, CA, USA
May 13, 2021   |  By Quentin Rousseau
Service Level Objectives (SLOs) are a key component of any successful Site Reliability Engineering initiative. The question is, what are SLOs; and how do you determine what your SLOs should be? Once you've done that, how should you use them?
May 6, 2021   |  By JJ Tang
Let's all face it, on call work isn't fun. But it can be better. Even if you have to work on call, it would be nice to have at least some of the work done for you, before you drag yourself out of bed at 3am to respond to an incident.
Apr 29, 2021   |  By Quentin Rousseau
Kubernetes makes it easier in certain ways to manage reliability. But incident response teams and SREs must also be prepared to handle the unique reliability challenges that K8s creates.
Apr 22, 2021   |  By JJ Tang
How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.
Apr 15, 2021   |  By Quentin Rousseau
SREs may have better long-term job prospects, but DevOps might be an easier career to pursue.
Apr 7, 2021   |  By JJ Tang
The Suez Canal has been big news over the last couple of weeks. We wondered how a Site Reliability Engineer (SRE) might conduct a postmortem on what happened with the Ever Given, and what that might mean if a comparable incident occurred at a modern tech company.
Apr 2, 2021   |  By Quentin Rousseau
By adding new complexity to reliability engineering and making physical war rooms a thing of the past, COVID-19 has imposed permanent changes on incident management. Here’s how SREs can respond.
Feb 7, 2021   |  By Camille Hodoul
Successful and blameless postmortems can turn incidents into a gift of learning and prevent repeat mistakes.

Rootly is a turnkey incident response command centre that brings the best reliability practices from Google, Netflix, Amazon to those without a million-dollar budget.

Rootly is an all-in-one platform that streamlines collaboration, communication, and learning. It automates away manual toil engineers suffer through today and captures data-driven insights. With Rootly, companies accelerate their incident resolution and learn how to prevent them in the future.

Teams depend on Rootly to improve their reliability:

  • Collaborate: Seamlessly handoff alerts from PagerDuty to quickly declare incidents from your tool of choice like Slack. Automatically involve all the right teams in seconds, not minutes. Beyond just engineering but loop in legal, support, and sales. With intelligent workflows, no more wondering what team owns which service or who should be responsible for what. Rootly does the heavy lifting for you.
  • Communicate: Build your incident timeline through Web or Slack. Autolink war rooms with our Zoom & Google Meet integrations. Rich and customizable private and public status pages ensure everyone is updated while you focus on what you do best, fighting fires.
  • Remediate: Enrich your timeline with automated Genius workflows. Fetch relevant information as recent git commits of your impacted services. Customize your workflows based on any incident condition.
  • Retrospective: Learn from incidents with beautiful postmortems engineers want to write without the manual toil of copy and pasting. Accurately replay past incidents to help simulate real world disaster scenarios to train engineers faster and keep their tools sharp. Organized and easily shared, not buried in a Google Doc that can’t be found.

All-in-one incident response platform for humans.