Portland, OR, USA
Jan 7, 2021   |  By Scott Fitzpatrick
When organizations adopt Agile development practices, they also adopt shorter release cycles. These shorter release cycles mean more frequent application deployments – and more frequent deployments increase the risk of an unstable release.
Jan 6, 2021   |  By Cordny Nederkoorn
In the current IT market, one of the hottest job roles is the Site Reliability Engineer (SRE). In January 2019, according to LinkedIn, being an SRE is the second most promising job in the USA. These Statistics were cited: In this post we will have a look at what an SRE does in their daily work, a little history on Site Reliability Engineering, and what the foundations are; and how you can become an SRE.
Dec 30, 2020   |  By Chris Tozzi
DevOps has transformed the way we think about the roles of the IT team. Now, IT engineers don’t just maintain software post-deployment. They are expected to collaborate closely with developers (and vice versa) to fix bugs, design features, and so on. But how far, exactly, should IT engineers go in this role? In other words, how much does the IT team really need to know about coding and the “left” part of the continuous deployment pipeline?
Dec 28, 2020   |  By Chris Tozzi
Being a developer once meant writing code, possibly testing and building it, and then calling it a day. In the era of DevOps, however, the work performed by developers has changed tremendously. Today, developers are expected to collaborate extensively with IT teams, and vice versa, to help manage everything that happens to code after it passes out of developers’ hands – that is, after it is built and deployed into production.
Dec 17, 2020   |  By Steve Tidwell
Modern IT has changed enormously over the last few decades and created a plethora of new opportunities for individuals and businesses. In the process, what were once considered traditionally “IT” roles evolved significantly from the prior eras.
Dec 15, 2020   |  By Steve Tidwell
Poorly implemented postmortems can be painful for everyone involved; they cost money, and worse yet, they can fail to address the root cause of the problem. In this post, we will discuss some of the pitfalls of postmortems and introduce several best practices that can help smooth the postmortem process — including choosing the right personnel, creating a culture of accountability, and conducting “blameless” postmortems.
Dec 10, 2020   |  By Joshua Thorngren
Site Reliability Engineering (SRE) is a rapidly-growing discipline. While SRE originated at tech giants like Google and Facebook, SRE principles are now being practiced at companies of all sizes. Still, SRE is an emerging discipline, and there are many different philosophies on how to implement it for success. In this blog, I share a list of the top 10 SRE books that helped me understand SRE core tenants, practical implementation, and the relationship between SRE and other methodologies like DevOps.
Dec 1, 2020   |  By Chris Tozzi
If you work in IT Ops, SRE, or DevOps, you don’t need to be told that every second counts in incident response. You already know that. The challenge for most incident response teams, however, lies in figuring out how you actually improve incident response speed. Beyond obvious, basic steps – such as taking advantage of automation tools for alerting and monitoring, and doing effective post-mortems – strategies for making incident response faster can be elusive.
Nov 18, 2020   |  By Steve Tidwell
DevOps and SRE are sometimes referred to as competing or separate disciplines. This post looks at DevOps VS SRE, showing that they are not really competitors, but rather, complementary to one another. We will also explain how to implement DevOps as a culture by using SRE principles and best practices – with a bit of Agile thrown in for good measure.
Nov 17, 2020   |  By Mike Mackrory
When something goes wrong in your production environment, you want your best and brightest minds to start working on the problem as soon as possible. The best time for me to work on a production problem is when I’ve got my day planned, I’ve just finished my morning coffee, and my mind is primed and ready for action. Unfortunately, production incidents seldom occur at this time; usually, it’s more like 3:24 AM.
Dec 1, 2020   |  By jeannie CHRISTENSEN
This 2 minute video demo shows how StackPulse works with GitHub Actions.
Oct 12, 2020   |  By Nuaware
Leonid, co-founder and CTO of StackPulse joins Luke to discuss SRE platform automation and why incident response as code is the future for enterprises.

StackPulse empowers SREs and developers to reduce toil, remediate incidents faster, and build more reliable services.

StackPulse gives engineers and SREs everything needed to build and run more reliable services - during incidents, at deployment, or when writing code. By centralizing and automating reliability across an entire environment, StackPulse makes it easy to manage services in production at any scale.

Let’s build a more reliable world:

  • Automated Alert Enrichment: When an alert is triggered, StackPulse automatically triages and enriches the alert with impact, environment details, and root cause analysis. This context is delivered in real-time to on-call teams - simplifying remediation and helping drive down both MTTD and MTTR.
  • Powerful Playbooks to Reduce Toil: StackPulse playbooks are powerful code-based workflows that investigate, remediate and maintain your software services - reducing toil for your teams. Playbooks can be imported from StackPulse’s playbook library, built via drag and drop step builder, or deployed as part of a GitOps workflow.
  • Centralized Knowledge and Insight: StackPulse automatically documents and analyzes incident details and remediation patterns - centralizing tribal knowledge and delivering recommendations to proactively improve your services.
  • Easy to Integrate; Simple to Scale: StackPulse deploys in minutes, with out-of-the-box integrations to your existing alerting, on-call and compute stacks. With a consistent framework for playbook trigger and execution, StackPulse lets you modify or scale your underlying environment with no impact to your incident response practice.
  • Best Practice Playbooks: The StackPulse playbook library contains tested and verified steps and playbooks - making it easy to quickly improve your MTTR and MTTD in common scenarios. Playbooks can be easily exported or shared for portability.

Reliability at your fingertips.