The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.
Service incidents are unavoidable in today’s complex and dynamic IT environments. They can cause significant disruption to business operations, customer satisfaction, and revenue. However, many organizations are still struggling to manage service incidents effectively. Here, we will explore some of the common challenges faced by ITOps team and how HEAL, an AI-powered tool, can help conquer them.
For this week’s installment of “The concise guide to Loki,” I’d like to focus on an interesting topic in Grafana Loki’s history: ingesting out-of-order logs. Those who’ve been with the project a while may remember a time when Loki would reject any logs that were older than a log line it had already received. It was certainly a nice simplification to Loki’s internals, but it was also a big inconvenience for a lot of real world use cases.
C-suiter, VP’er, or anyone who’s top of the pile in an enterprise faces inherent pressure that comes part and parcel with the role that they’re taking on. Many of the pressures can’t be overcome, it’s simply the nature of the beast. But dealing with technical issues in their day-to-day life is one of their biggest gripes, because it always feels like a problem that should be solved – not one needing to be dealt with again and again.
In Java applications, concurrency issues can be difficult to reproduce and debug. Because work is scheduled nondeterministically across threads, the conditions that have led to an error in one execution of the program may not trigger the same issue the next time around. Exceptions that are silently handled—also known as swallowed exceptions—can also be challenging to debug because they typically do not leave any trace in the logs.
Like everyone else in the world, we are thinking hard about how we can harness the power of AI and machine learning while also staying true to our core values around respecting the security and privacy of our users’ data. If you use Sentry, you might have seen our “Suggested Fix” button which uses GPT-3.5 to try to explain and resolve a problem. We have additional ideas being developed as well that we’re excited to preview.
Orange España, Spain’s second largest mobile operator, suffered a major outage on January 3, 2024. The outage was unprecedented due to the use of RPKI, a mechanism designed to protect internet routing security, as a tool for denial of service. In this post, we dig into the outage and the unique manipulation of RPKI.
Imagine a symphony where every musician plays their part flawlessly, but without a conductor to guide the orchestra, the result is just a discordant mess. Now apply that image to the modern IT landscape, where development and operations teams work with remarkable autonomy, each expertly playing their part. Agile methodologies and DevOps practices have empowered teams to build and manage their services independently, resulting in an environment that accelerates innovation and development.