Most companies take their integration infrastructure for granted. I’m talking about middleware such as IBM MQ, Kafka, Solace, ActiveMQ, RabbitMQ. These form the basis of most enterprise-level businesses.
One of our electronic manufacturing customers was building products worth $40K per minute. A failure in one of the factory floor’s automated systems brought manufacturing operations to a complete halt. After hours of searching and troubleshooting, the IT teams determined that the root cause of the problem was a server that had crashed at 2:30 a.m., causing messages to the main production application to back up until its capacity was exceeded and the application shut down. The problem wasn’t discovered for several hours, and then it took 3 hours to find and fix the root cause. In one day, the company lost $7m in product revenue.
According to Gartner, the average cost of IT downtime is $300K per hour. Last year Facebook lost $79 million due to a prolonged outage. National Australia Bank had to pay $7.4 million in compensation due to an outage in its payment system. One of our retail customers does 60% of its annual business in 10 days. Having systems up and running over the holiday period means life and death to that company.
Most major companies have enterprise system monitoring from SIEM, AIOps, or APM vendors such as Splunk, AppDynamics, Dynatrace, Tivoli, or BMC. These include rudimentary plug-ins for integration infrastructure such as IBM MQ, which typically pick up the important events such as queue full or channel down. So the question is, “Is this enough?”. How much time and money does it cost, having identified that the integration infrastructure is at fault, to drill down to the exact root cause within (or via) the middleware and to fix it? Do you have enough diagnostic data? Do you have the skills? Do you have the tooling? Is the vendor able to support you on the phone with middleware expertise throughout the outage?
Nastel’s customers use our Integration infrastructure Management (i2M) solution (Navigator X) to enhance the existing tooling. We take a proactive SRE approach based on an in-depth understanding of the middleware to prevent the outage altogether by monitoring key indicators such as:
- Latency – Time to service a request
- Traffic – Demand placed on the system
- Error Rate – Rate of Failed Requests
- Saturation – Which resources are most constrained
A recent new customer said that they bought our solution to enhance their ITSM strategy for the following reasons:
- A combined solution to drill straight from the IT monitoring to the configuration tooling for remediation
- The domain expertise (integration infrastructure) in Nastel’s people and embedded in the solution
- A single tool for managing the whole integration infrastructure from a single screen, e.g., IBM MQ, Apache/Confluent Kafka, TIBCO EMS, IBM ACE/IIB, and Solace
- Removing ‘Key Person Risk’ due to the extensions and homegrown software that had been written to enhance the ITSM solution with middleware scenarios
- Privileged access management, e.g., tracking a payment transaction as it travels through the entire infrastructure, ensuring that no one changes it in any way (amount, payee, etc.)
Debenhams estimated that Nastel’s specialized monitoring saves them 3 hours of staff time every day and saved thousands of dollars that they would have otherwise wasted just treating the symptoms.
So, is integration infrastructure critical to your business? I’m very keen to hear your stories. Please leave a comment below or contact me directly and let’s discuss it.