I realize it's been a while since I've written about logging, but my experiences this morning compelled me to share.  The muse for this post was a very simple bug located in a rat's nest of complexity.

The code in question drives the I/O layer in a device that is filled with proprietary hardware.   There are about three dozen channels of internal communication when this thing is running, which doesn't include any of its networking.  So lots of asynchronous I/O and threading.  All supporting an awesome graphics and physical interface layer that is equally complex on its own.

Except that today, it was a paperweight.  In fact, you couldn't even use the operating system effectively.  Eventually task manager would pop up and show that something in the application was suffocating the CPU.  I was not looking forward to diagnosing this problem given the level of complexity involved.  I expected to spend the morning on it, but in all it took about 10 minutes to find and fix the problem.

I had worked with the application to know what pattern of messages to expect in the log files, but when I opened them I found some of the messages to be missing.  Specifically, the messages pertaining to just one of those three dozen communication channels I mentioned earlier.  After some initial testing, I had isolated the problem to a case where that channel was misconfigured.  I re-ran the application and grabbed the fresh logs.

Since I mirror my class structure in logger objects, isolating the relevant log entries was easy using logparser.  The list was surprisingly short, ending with this message:

initiating PSU polling thread 

I quickly found this message in my code:

void PollForReports()
    Log.Info( "initiating PSU polling thread" );
    while( 1 == Interlocked.Read( ref suspendInterlock ) )
    Log.Debug( "requested PSU serial number ..." );
    // ...

As you can see, a "requested PSU serial number..." log message should have been written almost immediately after the last message I found in the log.  Something in the code after the "initiating PSU polling thread" message is written and before the "requested PSU serial number ..." message is locking the CPU. 

Do you see it - the improperly implemented spin lock?  It's pretty obvious when you know where to look, and logging can make it a lot easier to know where to look.