%term

Podcast: Break Things on Purpose | Natalie Conklin: Learning to Embrace Change

May 3, 2022 By Julie Gunderson In Gremlin

Natalie Conklin, tamer of chaos and Head of Engineering here at Gremlin, joins us to talk about embracing change, working alongside each other, and building more reliable systems. Natalie has a talk coming up at DevOpsDays Boise which she has titled “Embracing Change Fearlessly.” Her talk is oriented around enabling teams to take calculated risks and having the guts to take those risks. Natalie spent time working in India, which helped solidify her “fearlessly” philosophy.

Read Post

Gremlin

Read more about Podcast: Break Things on Purpose | Natalie Conklin: Learning to Embrace Change

Site Reliability Chats (April 27, 2022)

Apr 27, 2022 By Gremlin In Gremlin

View Video

Gremlin

Read more about Site Reliability Chats (April 27, 2022)

Site Reliability Chats (Apr 20, 2022)

Apr 20, 2022 By Gremlin In Gremlin

In this episode Julie and Jason share updates on the Atlassian outage, a new incident at Cerner, and problems at the IRS. They also cover post-incident investigations from Cloudflare and Datadog.

View Video

Gremlin

Read more about Site Reliability Chats (Apr 20, 2022)

Podcast: Break Things on Purpose | JJ Tang: People, Process, Culture, Tools

Apr 19, 2022 By Jason Yee In Gremlin

For this episode we’re continuing to “Build Things on Purpose” with JJ Tang, co-founder of Rootly, who joins us to talk about incident response, the tool he’s built, and his many lessons learned from incidents. Rootly is aiming to automate some of the more tedious work around incidents, and keeping that consistency. JJ chats about why he and his co-founder built Rootly, and the problems they’re trying to fix and eliminate when it comes to reliability.

Read Post

Gremlin

Read more about Podcast: Break Things on Purpose | JJ Tang: People, Process, Culture, Tools

Chaos Engineering & Autonomous Optimization combined to maximize resilience to failure

Apr 14, 2022 By Kyle McMeekin In Gremlin

Today’s enterprises are struggling to cope with the complexities of their environments, technologies, and applications. On top of these challenges, they face faster release rates, and the need to always deliver the highest level of performance and availability to end-users, at the lowest possible cost.

Read Post

Gremlin

Read more about Chaos Engineering & Autonomous Optimization combined to maximize resilience to failure

Site Reliability Chats (Apr 13, 2022)

Apr 13, 2022 By Gremlin In Gremlin

In this episode, Julie and Jason cover recent outages of the Dutch NS trains, American Express, and the on-going, long-running incident at Atlassian. In positive news, they cover the acquisitions of Puppet by Perforce and Chaos Native by Harness, and Grafana Lab's series D funding.

View Video

Gremlin

Read more about Site Reliability Chats (Apr 13, 2022)

Podcast: Break Things on Purpose | Elizabeth Lawler: Creating Maps for Code

Apr 5, 2022 By Jason Yee In Gremlin

For this episode of “Build Things on Purpose” we are joined by Elizabeth Lawler, founder of AppLand, the creators of AppMap. Elizabeth is here to chat about the challenges of building modern, complex software and the tool that she has built that serves as a “Google maps for code” for developers. AppMap is designed to show in a more visually driven way to help clarify, in real time, writing code.

Read Post

Gremlin

Read more about Podcast: Break Things on Purpose | Elizabeth Lawler: Creating Maps for Code

Getting started with DNS attacks

Mar 31, 2022 By Andre Newman In Gremlin

Whenever an online service goes down, you're likely to hear three words: "it was DNS!" Blaming DNS might be a running joke among network admins and engineers, but it's one rooted in experience. DNS problems are known for causing massive, Internet-wide outages such as the 2021 Akamai outage that temporarily made the websites for Delta Air Lines, American Express, Airbnb, and others unreachable.

Read Post

Gremlin

Read more about Getting started with DNS attacks

Site Reliability Chats (Mar 30, 2022)

Mar 30, 2022 By Gremlin In Gremlin

In this episode, Jason is joined by special guest Mandi Walls, DevOps Advocate at PagerDuty. They discuss certificate-related reliability issues, updates to Github's ongoing MySQL incidents, Log4j problems, and Pokemon.

View Video

Gremlin

Read more about Site Reliability Chats (Mar 30, 2022)

Site Reliability Chats (Mar 23, 2022)

Mar 23, 2022 By Gremlin In Gremlin

In this episode, we chat about Github's recent outage and dive into the incident report from their previous outage in February. We also discuss the latest NPM controversy regarding open source, politics, and protests. Our final segment covers an update to a new piece that we featured in our very first episode.

View Video

Gremlin

Read more about Site Reliability Chats (Mar 23, 2022)

Operations | Monitoring | ITSM | DevOps | Cloud

Podcast: Break Things on Purpose | Natalie Conklin: Learning to Embrace Change

Site Reliability Chats (April 27, 2022)

Site Reliability Chats (Apr 20, 2022)

Podcast: Break Things on Purpose | JJ Tang: People, Process, Culture, Tools

Chaos Engineering & Autonomous Optimization combined to maximize resilience to failure

Site Reliability Chats (Apr 13, 2022)

Podcast: Break Things on Purpose | Elizabeth Lawler: Creating Maps for Code

Getting started with DNS attacks

Site Reliability Chats (Mar 30, 2022)

Site Reliability Chats (Mar 23, 2022)

Monthly Archive

Follow Us