What’s one of the fundamental principles of DevOps? Automation. There are many ways to leverage automation to facilitate DevOps practices for enabling consistency, reliability, and efficiency within the organization. That’s why we’re taking serious strides to ensure that xMatters can allow full automation and coordination of the many tools we use to make incident management easier and more efficient for front-line responders.
Getting your pricing right is critical to the success of any SaaS company, but finding a model that works can be tough. Price too high, you won’t close enough deals - your business will fail. Price too low, your business model will be unsustainable - your business will fail. To add to the complication, when you’re a new startup your goals are evolving.
AIOps combines machine learning and people to deliver technical outcomes in IT operations. The promise of this capability continues to drive new contenders to the market. AIOps has become a core messaging component for all the major event management players. Many have just rebranded their products to specifically highlight AIOps features. Emerging event management players have arrived and tried to also claim the AIOps space.
Endpoint protection is a security approach that focuses on monitoring and securing endpoints, such as desktops, mobile devices, laptops, and tablets. It involves deploying security solutions on endpoints to monitor and protect these devices against cyber threats. The goal is to establish protection regardless of the endpoint’s location, inside or outside the network.
Dave Mangot, author, speaker, and consultant, sits down with Moogsoft to discuss moving to SaaS from On-Prem in the premiere episode of “Mooving To…”
Severity and priority can be challenging for a company to nail. When an incident is declared, it's essential to have a system to define the impact and how urgently it should be handled. Incident severity and priority are the two knobs teams can leverage to define scope and urgency, and eventually, the appropriate process to take action. But how should we define them, and what are the differences?
Picture yourself trying to resolve a code error when you notice an additional issue outside your realm of expertise that's making matters worse. Your instinct is to get in touch with the right contact as quickly as possible to resolve the issue so that there's no further impact on the system's uptime. But what if you can't get in touch with them immediately, or don't know who to contact? Instead of trying to solve the problem without support, a DevOps toolchain could have mitigated this chain reaction from the start.
It is a well-established fact that companies looking to grow in the digital age can facilitate this mission by adopting the cloud. When pursued with the right intent and implementation strategy, cloud adoption acts as a powerful force multiplier, yielding a cutting-edge IT powerhouse for businesses and helping them grow and innovate at an accelerated pace. Organizations that adopt a cloud-first strategy must safeguard themselves from critical, service-disrupting incidents.
We are excited to announce that PagerDuty is now an approved AWS Financial Services Competency Partner. We’re looking forward to expanding our global reach and helping financial services organizations accelerate their cloud migration and digital acceleration journeys. This will allow us to further streamline and automate financial service companies’ digital operations while helping them reduce risk and manage compliance requirements.
In this episode of Mooving to… Stability: The Role of Catastrophic Failure in Software Design, we had the opportunity to chat with Jeff Atwood, yes that Jeff Atwood of, Coding Horror, Stack Overflow, and Discourse (Chief Happiness Officer). Jeff started writing 911 software in Boulder, Colorado for a small company, which was a crash-course in writing code for software that has real consequences. With this unique and deep perspective, B.J.
This post highlights some of the features and improvements that we have released in the last 6 months.
We’re a small startup (10 people at time of writing) with big ambitions, particularly when it comes to our product. With so many things we want to do, it’s important for us to be structured the way we approach our work, without being so process-driven that we lose all the benefits of being small and nimble. As we’re still new, and the team is growing all the time, very little is set in stone.
We wrote this article in response to a question asked in our Slack Community. Click here to join hundreds of technology leaders discussing best practices for incident response! ✨ We know a thing or two about incident response. As such, we're often asked to advise when companies are designing their incident response processes. A common question is "How do you design your incident severity levels?". It's a great question given how central they are to incident response!
When you think of who uses feature flags, your mind most likely goes to developers. In general, feature flags are closely associated with software engineering. But Site Reliability Engineers, too, can benefit from feature flags. SREs may not be the ones to create feature flags, but they should work closely with developers to ensure that the applications their teams support include feature flags.
Runbooks have been a game changer for many incident response teams, and we just made it easier for you to get up and running with them. Runbooks reduce toil for responders and ensure consistency in your incident management processes.In the thick of trying to resolve an issue, remembering things like emailing customers is likely the last thing on responders minds but yet forgetting to do so can be detrimental.
In the past five years, DevOps adoption has almost doubled. In fact, 74 percent of companies now use DevOps in some form. As a growing number of organizations seek to implement DevOps practices, the need for qualified DevOps engineers is soaring. But what exactly does a DevOps engineer do, and what skills are required to succeed in this in-demand role?
Over the last few years, our world has become increasingly digital, from streaming and shopping to work and health care. Customers want these digital experiences to be seamless. This has become a key priority for all businesses as well, as they depend on happy customers to drive sales and brand reputation. To ensure these seamless digital experiences, technology teams have doubled down on reliability, user experience, and building new features.
With our February update, it is now possible to centrally configure how Signls should be notified. And of course, each team can have a different configuration of their notification preferences. This also includes response and escalation settings. In addition, it is now possible to set different notification patterns per day and time of the day, e.g. to notify via different channels at night than during office hours.
Based on our newfound data feet, we’ve started consistently tracking the adoption rate of our latest features. As it happens, we’ve been impressed with the results! For example, we were delighted to see that our new tutorial flow was completed end-to-end by 35% of our users (against an industry average of less than a quarter for 6-step product tours like ours). I know, I know: being at such an early stage means it is arguably easier to hit customer needs on the head.
At the time of writing this post, I have officially been at Honeycomb for one year as a site reliability engineer (SRE). I had shared my initial experiences and impressions in this post and thought it would make sense to check back in now that I’ve had the opportunity to spend time learning about the team, the culture, and the code base more in depth.
Change management is an organized, structured approach with methods that enable healthcare organizations to transform workflows seamlessly. Organizational change management requires the collective involvement of C-level executives and stakeholders to successfully implement changes within a care facility. Change is required when individuals, processes, teams, and tools cannot keep pace with the ever-changing needs and expectations of the organization.
If you love Jira then you probably love customization, and we’ve made your integration with Jira Cloud and Jira Server even better with multi-project support! You can now route your incident tickets and follow-up work to remediation teams' Jira projects directly from FireHydrant, saving you valuable time and clean-up work. Let’s take a look at what has changed and some additional use cases unlocked with this integration.
At PagerDuty we invest a significant part of our time listening to our customers. From what we have learned from those conversations we are adding a new set of features to our Slack Integration. These features will make leveraging PagerDuty from Slack even more seamless and allow Incident Responders to conduct their work without switching context, expediting response times, and ultimately maintaining high customer satisfaction.
There’s no one-size-fits-all incident response process. Depending on your organisation’s shape and size, you’ll have different requirements and priorities. But the same three pillars form the core of any good process, whether it’s for the largest e-commerce giant or a scrappy SaaS startup.
Maintaining the reliability of complex services just got easier with Operational Readiness Checklists. Service owners and engineering leaders can now evaluate and maintain the production readiness of the services their users rely on every day: spot risks in your service dependencies before they cause incidents, and respond quickly if they do. Before you put a new service into production, readiness checklists help you dot-your-is and cross-your-ts.
System outages are the worst nightmares for IT support teams, but they also provide an opportunity to stand out. During a major service outage, customers are often impacted a lot more because they have much less information about what is happening. Some of the biggest outages that affected users all over the world last year include those of Slack, PlayStation, Airbnb, FedEx, and Amazon.
A list of the top nine SRE skills, from incident management, to cloud computing, to networking and beyond.
SIGNL4 integrates with various backend systems like IT monitoring, service management, IoT systems, sensors, etc. to automatically alert users and teams about certain incidents. A list of selected tools along with integration descriptions is available in our integrations section. How can you integrate SIGNL4 with your own tools? In the following we list some options offering different levels of sophistication.
What makes an engineering team? Communication, collaboration, process, order, and common goals. Otherwise, they would just be a bunch of engineers. The same is true of their tools. Connectivity and process turn a bunch of tools into a DevOps toolchain. If you need a DevOp toolchain, you can use it to easily build an incident response process.
A question we are hearing often is related to manager escalations, a heavily utilized feature in SIGNL4. Users ask us if those managers can be scheduled. The short answer is ‘yes’, but you need to use a different feature in SIGNL4 and do a little re-configuration.
Every second counts when IT teams are called upon to resolve business impacting issues. In modern enterprises, poor communication, fragmented toolchains and spiralling IT complexity can conspire to slow down incident response, putting service availability and ultimately customer satisfaction in peril.
The role of an engineer at a startup is a tangled web: as well as writing code, you have to be your own product manager, QA tester, customer support and designer. But there’s another hat that you have to wear which you might not have thought about: copywriter. All products have copy, from welcome messages to text on a submit button. At incident.io, we have to put on our copywriting hats every time we add a new feature.
We have released an update for Enterprise Alert 9 (version 9.3) that revolutionizes our OPC connector and also includes some bug fixes. Read all the details in this article.
From alerting to during to post incident, great communication is the key to effective incident response.
When downtime strikes any distributed software deployment or platform, it's all hands on deck until the lights are green and service is restored. This process, from the recognition of a problem to a deployed solution, has most commonly been defined as MTTR - mean time to resolution. In just the last few years, DevOps and site reliability (SRE) professionals have developed sophisticated new models for how they work and audit their successes. In 2022, MTTR is one of the most widely-used software performance success metrics.
If you’re working at an early stage startup and looking to get some good incident management foundations in place without investing excessive time and effort, this guide is quite literally for you. There’s an enormous amount of content available for organisations looking to import ‘gold standard’ incident management best practices – things like the PagerDuty Response site, the Atlassian incident management best practices, and the Google SRE book.
Last year, we released PagerDuty Rundeck Actions, a PagerDuty add-on product that connects responders to automated diagnostics and remediation for common problems directly in the PagerDuty incident response workflow. After working with our customers and listening to the community, we are excited to announce that PagerDuty Rundeck Actions now integrates with PagerDuty’s Slack integration.
A huge challenge when dealing with incidents is the coordination and communication needed to put things right. What’s happened so far? Who has tried what query? Did we remember to keep stakeholders informed? What is the severity of the incident? Does this affect customers? Figuring this out requires a lot of back and forth as new team members join the incident.
Today we’re announcing the general availability of Grafana OnCall on Grafana Cloud for all paid and free plans. A big part of delivering great software is ensuring the right people get the right information when the inevitable incidents occur. We want to help you do that with Grafana OnCall, an easy-to-use, developer-first on-call management tool that’s built on top of the Grafana stack you know and love.
You may have heard of Round Robin Scheduling before and thought to yourself, is this right for my team? Understanding how Round Robin Scheduling can be used and what teams it works best for is important when considering this method of on-call. Additionally, it comes with some pitfalls you’ll want to avoid, as well as best practices to adopt. In this blog post, we’ll share everything you need to know about Round Robin Scheduling within PagerDuty and how to get started.