Finding Five Seconds

Problem solving has always been a hobby to me. A hobby I very much enjoy. Maybe not so much when in the middle of a problem that seems unsolvable - but after cracking it, the thrill is so much greater.

This particular problem had been nagging my customers for some time. Some speculated that it had been around for several years. It's certain that a special task force had been formed 5 months prior specifically to solve this (and similar) problems. Let's start with the symptoms.

A J2EE application that used another J2EE application as session component had terrible performance. Using Selenium the task force could show that a simple scenario involving some 15-20 requests took more than 4 minutes to complete. The applications were deployet on the same WAS cluster consisting of two Windows 2003 WAS nodes, but each application has a load balancer and an array of WebSEALs in front of it. The request therefore travelled through the following components: LB - WS - WAS - LB - WS - WAS. The task force spent a lot of time searching through log files in order to find the guilty component. As it usually happens, a lot of misunderstandings muddled the picture, but eventually the first WAS in the chain was marked as the culprit.

Several test applications were written and the problem seemed clear: When the session application was requested, the response was delayed almost 5 seconds! This behaviour could not be replicated on our local machines, nor was the delay experienced when hitting it with a browser from the servers. Why was this happening? We were clueless.

I took over the task one wednesday and after some inital log reading I confirmed the findings of the task force. I rewrote the test applications to understand how they worked and to test several new things - trying to pinpoint the exact statement eating the 5 seconds. It became clear to me that it was not 5 seconds but exactly 4500 milliseconds. The smell of timeout was heavy in the air, but what timeout? The tracking ended where the Java code opened a new connection - but why would this take so up so much time?

In the evening I could not let the problem go. I had altered the test application to test any DNS or IP address I threw at it, and in a freak attempt to get somewhere, I tried testing with all kinds of requests: www.google.com, internal web services, external web sites in our DMZ, random machine names, whatever could be hit with a HTTP request. Some were slow, some were fast!

Suddenly a pattern started emerging. It struck me that all DNS addresses with reverse DNS record were fast, and those without were slow. Now we were getting somewhere! The next morning I made a TCP trace of the WAS server. BAM - the core error was found. Three requests to the target server on some NBNS port with an interval of 1500 ms each. That was the timeout I was looking for. So I hit the Internet to learn some new stuff about DNS and NetBIOS Name Service. It seems that Java 1.2 introduced a reverse DNS lookup as part of the process of creating a new connection. But this check is not really used. The DNS lookup is cached in Windows, but that's appearently not enough. Microsofts implementation of NetBIOS, called WINS, takes over when DNS fails and tries to get a host name from the target server. I checked the settings for service: Retry count 3, Timeout 1500 ms. That was it!

The question remained: Why did this happen and how should it be solved. I could not look into the native Java code, but I'm dead sure that the problem is to be found here. One solution would certainly be to rewrite the Java implementation of making HTTP connections. But that seemed like a lot of work. The straightforward solution was to set up reverse DNS on all and every DNS. Now the WINS is never asked, because the reverse DNS lookup finds an answer. Simple, but not very satisfying. But then again, we saved 4500 ms - on every request! The simple Selenium scenario mentioned above now only took 1 minute 20 seconds to complete!

The victory was celebrated with champagne (seriously!) and the good news spread like fire through the entire IT department... People from the business part of the company called us up, asking what we had done - the application suddenly showed excellent performance... sweet!