Operations | Monitoring | ITSM | DevOps | Cloud

Alerting

"TRIBAL KNOWLEDGE" (noun): That thing you should have done, if only someone had told you.

As a former NOC engineer, I clearly remember my onboarding, and especially the deep-rooted fear I felt every time I encountered an alert that was new to me – particularly during a night shift. My only consolation was that I was never alone during training, so there was always someone I could ask that very awkward question: “I’m new here, what do we do with this…?”

Grow your blame-free culture with these postmortem best practices

Bugs will happen from time to time. As our systems grow in complexity, new functionalities mean new risks. What makes or breaks a team is not only how it handles incidents, but also how it learns from them. This is where incident postmortems come into the picture.

You can't shoot for 5-9's without having powerful incident resolution capabilities

With the advent of more and more connected systems and devices, organizations are facing ever greater data security challenges. Moreover, with the accelerating cloudification of apps and even whole infrastructures, security professionals are also faced with the challenge of how to protect critical assets including personal and other sensitive data as well as IP.

AIOps for Dummies

Artificial Intelligence for IT Operations, or AIOps, is very much in its infancy, and most IT Operations pros are still trying to figure out what it means. That's not to mention thinking about where to even start. Each chapter in this eBook provides a deeper understanding of AIOps with actionable steps to help you get started on the path to revolutionizing your IT Operations.

Announcing Unified IT Status Notifications from "Big 3" Cloud Providers

StatusCast helps corporations keep their employees happy by providing unified IT status notifications, which gives them the ability to communicate IT status updates with their employees from a single location. Having to check both a corporate IT status page and a separate one for the organization’s cloud provider to determine the extent of IT issues, lowers employee productivity and job satisfaction.

Hrushikesh shares his journey into SRE and his thoughts on the future of this space

Hrushikesh is passionate about making a complex design with simple and reliable solutions. He is technology and platform agnostic and doesn’t believe in limiting himself to just a few. He started his career in 2006 with a Media company where he was responsible for introducing new technologies along with driving a team to deliver quickly. He does not limit his role to just development and operations and loves exploring everything in the tech space.

Dynamic alerts

The power and value that’s embedded in logs are reflected by the status and behavior of our applications and infrastructure. Many times we would like to be alerted when the application or its components show abnormal behavior. This behavior can be reflected by the application sending some logs at a higher than usual volume. Figuring out exactly what ‘higher than usual’ means, or in other words, setting the threshold value at which the alert should trigger can be a daunting task.