Effective Retrospectives

§ November 17, 2008 09:29 by beefarino |

Matt Grommes wrote another thought-provoker today on sprint retrospectives; my comments started wandering and found their way into this post...

The few retrospectives I've facilitated have taught me a few tricks.  Matt is correct when he says that a retrospective can turn into monster vetching session if allowed to do so.  In my opinion there are two keys to avoiding this:

  1. provide a structured way for the team to express their input;
  2. timebox everything. 

I've found the following retrospective structure to be effective in producing actionable feedback....

Prerequisites

I show up to facilitate a retrospective armed with the following items:

  • a white board with working markers and an eraser;
  • green, pink, and yellow post-its;
  • pens;
  • a kitchen timer;

Sometimes I bring a yummy treat, but I've had that backfire when the sugar rush ebbs and the team starts to wander mentally.  I also try to schedule the retrospective as the last event in the workday for everybody, but early in the week.  This seems to keep the team at ease, but focused.

First: Establish Trust in the Process

Like confessions extracted under coercion, feedback is worthless when obtained from someone who doesn't feel secure enough to be honest.  Agile processes in general rely heavily on open communication, and I found that my non-agile team displayed a lot of mistrust in the retrospective's goals.  

I begin each retrospective with a technique described in Agile Retrospectives: Making Good Teams Great that is aimed at measuring the willingness of the group to communicate.  Each person writes on a post-it a number between one and five inclusive, indicating their willingness to participate in the discussion to follow:

  1. I will not speak;
  2. I will be quiet and let others talk;
  3. Some things I will discuss, some things I will not;
  4. I'll talk about almost anything;
  5. I'll talk about anything.

The anonymous post-its are collected, tallied, and discussed briefly.  Obviously the goal is to have all fives, maybe one or two fours in the mix.  The times I've seen anything else, it's either because a communication problem exists on the team, or the goal of the retrospective is in question (e.g., QA feels like they will be scapegoated for a missed deadline).  

I keep discussion to five minutes, reiterating the retrospective Prime Directive, making sure that everyone knows they are expected to keep the conversation productive, and that the goal is to make the team work together better.

Second: Discuss Good Things

Next I pass out the green post-its and pens, and ask everyone to write down as many things as they can think of that worked well during the sprint.  I ask them to include team and personal efforts,  new or existing processes, anything they can think of that made a positive contribution to the effort.  I use the timer to box the task to 5 minutes.  

I collect the anonymous post-its and start going through them one at a time, collating as necessary.  I read each Good Thing aloud and open the floor for discussion.  I try my best to categorize them on the whiteboard as we go, so the team can see common trends.  For example, several post-its may comment on positive effects realized in documentation and QA from adding a bit of structure to the developer's subversion commit notes.

This part of the retrospective is generally pretty easy.  There isn't a lot of gray area in terms of choosing to maintain a practice that has a positive effect on the team. In addition, I find that discussing the positives first helps the team open up more during the next task...

Third: Collect and Categorize Bad Things

I pass out the pink post-its, set the timer for 5 minutes, and have the team jot down anything they felt impeded their work.  I emphasize "anything" - I would rather someone write down 10 also-rans and 2 gems than to spend the entire time deliberating on what to write at all.  

Against the advice of most retrospective experts, I ask the team *not* to exclude personal remarks about other team members; however, I do remind them that the remarks should be constructive, and that the retrospective is not a performance review.  For example, I would consider this appropriate:

Jim would benefit from writing more unit tests, having regular peer reviews, and mentoring.

but this inappropriate:

Jim's code sucks and never works.

As usual, I collect the post-its anonymously (I like to use a coffee can with a slit in the lid, BTW), during which time I draw two columns on the whiteboard: "Under Our Control" and "Out of Our Control".  I read each Bad Thing aloud and ask the team a single question:

Do we have control over this?

Their answer is usually unanimous, and it dictates which column the post-it falls on the white board.  There is no discussion of the Bad Thing at the point - the purpose of this task is only to isolate whether the Bad Thing is something team can fix themselves. 

Finally: Discuss Solutions and Take Action

Once the team can see what they can control and what they can't, the discussion can begin.  I spend a few minutes on the "Out of Our Control" items, but only to alleviate fears and to keep the discussion solution-focused and positive; I then remove those items from the board.

Moving on to the remaining post-its classified as "Under Team Control," I align them down the left side of the board in no particular order and draw racing lanes across the board for each.  I then ask the team which Bad Thing they believe had the most negative impact on their work, and we start the discussion there.

This is the part where I find the role of facilitator to be most important. It is very easy for the team to drift off-topic, or for to get bogged down in complicated solutions to simple problems.  I find it helps to focus discussion by reiterating one simple question:

What can we do today to prevent the Bad Thing from happening tomorrow?

Most of the time the team will produce a viable solution.  If the team can't gel on a fix, I pass out the yellow post-its, we storm out fixes for a few minutes, collate them in the racing lane, and then discuss.  The point is to keep the conversation on-target and constantly moving forward.  Once the team settles on a solution, I jot down the actionable items in the racing lane.

I repeat this process for each Bad Thing, allowing the team to choose the ordering, spending at most 10 minutes on each one.  If we get through them all in the retrospective's timebox, I'll open the floor to general discussion; however, my experience is that there is little else to discuss at that point if I've done my job.

Offline: Summarize and Reiterate the Solutions

Once the retrospective is over, I write up a short summary.  I list all the Good Things, and all the Bad Things and their proposed solutions.  I send this out to everyone involved, both pigs and chickens.  I also keep a copy handy to remind myself and others of the specific commitments we made as a team to better ourselves.

So there you have it Matt.  In my limited experience, what I find makes an effective retrospective is lots of structure to focus the team on one thing at a time and curb the naturally vetching tendencies.



Correlating and Grepping Logs with PowerShell

§ November 6, 2008 16:44 by beefarino |

I've spent the last few days doing a post-mortem on a colossal system failure at a client site.  We have plenty of data on the failure, in the form of log4net application logs and Windoze event logs, but the amount of data at our disposal is making interpreting and correlating those logs a gignormous task.  Some of the logs are XML, others are CSV, so making correlations between them has involved searching multiple spreadsheets and text files for information manually.  Correlating across the logs is made even more difficult as the event log export contains datetime instants that are local to the machine, whereas the log4net logs contain UTC timestamps.

So what do I do when presented with such a Gordian Knot?  I select a katana sharp enough to shred the knot into pieces.  Lately my weapon of choice for this type of problem has been powershell.  My goal is to be able to correlate events by time and find errors or warnings across all of these logs.  To that end, I'll be using some powershell Shikai to create object hierarchies out of XML and CSV data.  In addition, I'll release powershell Bankai and extend those objects with custom properties that allow both types of logs to be combined and searched as if they were a single log.

Importing Log4Net XML Log Files

My application logs use the XML formatter built-in the standard log4net distribution, which produces a partial XML document (as described here) similar to the following:

<log4net:event logger="MyApp" timestamp="2008-11-04T04:42:42.2612828-07:00" level="INFO">
    <log4net:message>validating xml digital signature on document</log4net:message>
</log4net:event>
<log4net:event logger="MyApp" timestamp="2008-11-04T04:42:43.1279382-07:00" level="INFO">
    <log4net:message>digital signature has passed validation</log4net:message>
</log4net:event>
...

Note there is no root element containing the <log4net:event /> elements, so the import script does a little fernagling to make the log appear to be a valid XML document:

# import-logxml
#
# imports a log4net application log file
#
# file: path to xml log file
#

param( $file )

# surround the log file contents with a root element
$xml = '&lt;log xmlns:log4net="uri:doesnotmatter"&gt;' + ( gc $file ) + '&lt;/log&gt;';

# load the log as XML and slurp out every log event
( [xml] $xml ).log.event |

    # add a note property indicating this log event is from an application log file
    add-member noteproperty -name logsource -value "applicationlog" -passthru |        
        
    # add a datetimeoffset property representing the instant of the log event
    add-member scriptproperty -name datetime -value {
            [DateTimeOffset]::Parse( $this.timestamp );
        } -passthru;

On line 11, contents of the log file are surrounded with a root element to make the log file valid XML.  Line 14 uses powershell's built-in [xml] datatype to translate the XML string into an object hierarchy, then retrieves every log event from the log file.  Each <log4net:event /> element from the XML log file is represented by an object containing properties corresponding to the node's attribute values and child elements.

The next few lines use powershell's add-member cmdlet to extend the log entry objects with new properties. Line 17 adds a simple property named "logsource" with a static value indicating that this object was created from an application log.  Line 20 adds a calculated property named "datetime" containing the instant of the log message as a System.DateTimeOffset structure.  Having this DateTimeOffset property will give me an appropriately-typed value against which I can compare, filter, and sort log entry timestamps.

Using the import-logxml script is straightforward:

> $log = import-logxml 'myapp.log.xml'

After running the script, the $log variable contains an array of objects representing the individual log events in the XML log file:

> $log[ 0 ]
datetime          : 11/04/2008 4:42:42 AM -07:00
logger            : MyApp
timestamp         : 2008-11-04T04:42:42.2612828-07:00
level             : INFO
thread            : 5076
message           : validating xml digital signature on document

Now powershell's filtering capabilities can be used to isolate log events; e.g., by severity level:
> $log | ? { $_.level -eq 'error' }
datetime          : 11/04/2008 4:44:16 AM -07:00
logger            : MyApp
timestamp         : 2008-11-04T04:44:16.8272804-07:00
level             : ERROR
thread            : 5076
message           : an exception was raised while processing the application configuration
...

Importing Event Log CSV Files

Importing the CSV event log data is a little more involved:

# import-eventlogcsv.ps1
#
# translates an event log export from
# CSV format to an object collection
#
# file: path to CSV export of event log
# offset: a timespan indicating the UTC offset
#     of the event log source
#

param( $file, [TimeSpan] $offset = [DateTimeOffset]::Now.Offset )

# add CSV column headers, if necessary
convert-eventlogcsv $file;

# import the CSV file
import-csv $file |

    # add a note property indicating this entry is from an event log
    add-member noteproperty -name logsource -value "eventlog" -passthru |
    
    # add a note property to store the value of the UTC offset
    add-member noteproperty -name offset -value $offset  -passthru |
    
    # add a datetimeoffset property representing the instant of the log entry
    add-member scriptproperty -name datetime -value {
            new-object DateTimeOffset
                ( [DateTime]::Parse( $this.date + " " + $this.instant ),
                  $this.offset
                );
        } -passthru |
        
    # add a property that translates the event log message type
    #    into a log4net log-level value
    add-member scriptproperty -name level -value {
            switch -re ( $this.eventtype  )
            {                
                "info" { "INFO" }
                "succ" { "INFO" }
                "fail" { "INFO" }
                "warn" { "WARN" }
                "error" { "ERROR" }
            }
        } -passthru

Event log timestamps are expressed in the time zone of the source computer, and in my particular case that source computer exists in a different time zone than my laptop.  So the import script accepts a parameter containing a UTC offset to indicate the time zone of the event log source computer (line 11).  

Importing CSV data into powershell is as simple as invoking the import-csv cmdlet (line 14); however, the import-csv cmdlet assumes the first row in the file contains column headers.  The event log exports no such header row, so one must be added.  I've wrapped up the process of adding this header row into a script (convert-eventlogcsv.ps1) that is included in the download for this post; this script is invoked on line 14.

The result of the import-csv cmdlet is an object array.  Each object in the array represents a single row of the CSV, with properties corresponding to each column in the row.  As with the application logs, several properties are added to each object using powershell's add-member cmdlet:

  • a "logsource" property indicating that the log event was read from an event log export;
  • an "offset" property containing the UTC offset passed into the script;
  • a "datetime" property that creates a DateTimeOffset structure representing the instant the log message was produced;
  • a "level" property that translates the event log entry type (Information, Warning, Error, etc. ) into strings matching the standard log4net log severity levels (DEBUG, INFO, WARN, ERROR, and FATAL).

Script use is straightforward:

> $evt = import-eventlogcsv 'eventlog.system.csv' ( [TimeSpan]::FromHours( -7 ) )

With the result similar to the log4net import:

> $evt[ 0 ]
date      : 11/04/2008
instant   : 4:42:42 AM
source    : MyApp
eventtype : Information
category  : None
eventid   : 2
user      : N/A
computer  : PPROS1
message   : application has started
logsource : eventlog
offset    : -07:00:00
datetime  : 11/04/2008 4:42:42 AM -07:00
level     : INFO

Correlating Events Across Logs

Now that the import scripts are available, correlating the logs is pretty simple.  First, we need to import our event and application logs:

> $log = import-logxml 'myapp.log.xml'
> $evt = import-eventlogcsv 'eventlog.system.csv' $log[ 0 ].datetime.offset

Note that the log4net log is imported first; this allows the time zone offest information it contains to be used to adjust the event log timestamps.  Once the two logs are imported, we can combine them using powershell's array concatenation operator:

> $data = $log + $evt

At this point, the $data variable contains the merged contents of the event log and application log.  The add-member cmdlet was used to extend these objects with custom properties during the import; now every merged log entry contains a logsource, datetime, and level property with identical types and semantics.  This means that we can now search, sort, and filter the combined logs using a single set of criteria.  For example, consider the following:

> $start = [DateTimeOffset]::Parse( "11/04/2008 4:40 AM -7:00" )
> $end = [DateTimeOffset]::Parse( "11/04/2008 4:45 AM -7:00" )
> $data | ? { $_.datetime -gt $start -and $_.datetime -lt $end }

which finds all events that occurred between 4:40 and 4:45 AM.  Or this:

> $data | ? { $_.level -eq 'error' -and $_.datetime -gt $start -and $_.datetime -lt $end }

which isolates all error-level events during the same time period.  Or this:

> $data | sort datetime

which outputs all log entries sorted by the time they were written.

Exporting the Correlated Logs

While powershell's interactive filtering capabilities are useful, you may find you want to export the newly merged logs for further analysis in a spreadsheet or other tool.  This can be easily accomplished using the export-csv cmdlet:

> $data | export-csv 'mergedlog.csv' -notypeinformation

Download the Code

The import scripts are available here: merginglogswithpowershell.zip (1.49 kb).

Enjoy!




Overlapped I/O Aborted by Terminating Thread

§ October 17, 2008 07:10 by beefarino |

A while back I posted about using Overlapped I/O from the .NET Framework.  I've started integrating the hardware with the rest of the project and hit a snag.  It seems that if a thread makes an overlapped I/O request and later terminates, the I/O request is aborted and your IOCompletionCallback routine receives error code 995 (system error code ERROR_OPERATION_ABORTED, or System.IO.IOException): "The I/O operation has been aborted because of either a thread exit or an application request".  I haven't looked into why this happens, but functionally it seems that the Windoze kernel assumes that the I/O request is valid only as long as the requesting thread is alive and kicking, which seems both perfectly reasonable and unreasonable depending on your perspective.  If you do happen to know the specifics on the kernel's behavior here, please comment; you'll save me some digging.

Example

Here is a unit test that illustrates the problem; a brief walkthrough follows the code:

[Test]
public void OIOTerminatesOnThreadExit()
{
    TcpListener listener = new TcpListener( 8888 );
    TcpClient client = new TcpClient();

    Exception exception = null;

    Thread listeningThread = new Thread(
        delegate()
        {
            listener.Start();

            // block until we receive a client
            TcpClient myClient = listener.AcceptTcpClient();

            // initiate an overlapped I/O operation on
            //  the underlying socket
            myClient.GetStream().BeginRead(
                new byte[ 16 ], 0, 15,                    
                r => {
                    try
                    {
                        // calling EndRead should
                        //    yield an exception            
                        myClient.GetStream().EndRead( r );
                    }
                    catch( Exception e )
                    {
                        // save the exception for later
                        //  assertion and validation
                        exception = e;
                    }
                },
                null
            );
        }
    );

    // start the listening thread
    listeningThread.Start();

    // connect to the TcpListener, so it can initiate an
    //  overlapped I/O operation
    client.Connect( Dns.GetHostName(), 8888 );

    // wait for the listening thread to finish
    listeningThread.Join();

    // verify
    Assert.IsNotNull( exception );
    Assert.IsInstanceOfType( typeof( IOException ), exception );
    StringAssert.Contains(
        "The I/O operation has been aborted because of either a thread exit or an application request",
        exception.Message
    );
}

Note that for brevity this test contains no error handling or Tcp timeouts, which it really should.

The test creates a thread that starts a TcpListener and waits for a connection.  Once a connection is established the thread issues an overlapped I/O read request on the network stream.  The AsyncCallback handler for the BeginRead operation just calls EndRead, saving any exception that occurs for further scrutiny.  Immediately following the overlapped I/O request, the listeningThread terminates normally.

Once the listeningThread is started, the unit test uses a TcpClient to connect to the TcpListener.  This will allow the listeningThread to make the overlapped I/O request and terminate.  After the TcpClient is connected, the test waits for the listeningThread to terminate.  

At this point, the test verifies that an exception was received from the BeginRead AsyncRequest callback and validates its type and content.

Workaround

My current workaround is pretty simple: kick off the I/O operation from the thread pool instead of an application thread.  Thread pool threads don't really terminate like application threads, they just go back into the pool when their work unit is complete, and the kernel seems to be content to oblige I/O requests from the thread pool even after the thread is returned to the pool (e.g., when you call an asynchronous BeginRead operation from your AsyncResult callback).

Here's the example with the workaround applied:

[Test]
public void OIODoesNotTerminateOnThreadPoolThreadExit()
{
    TcpListener listener = new TcpListener( 8888 );
    TcpClient client = new TcpClient();

    Exception exception = null;

    // start the listeningThread on the thread pool
    ThreadPool.QueueUserWorkItem(
        delegate( object unused )
        {
            listener.Start();

            // block until we receive a client
            TcpClient myClient = listener.AcceptTcpClient();

            // initiate an overlapped I/O operation on
            //  the underlying socket
            myClient.GetStream().BeginRead(
                new byte[ 16 ], 0, 15,                    
                r => {
                    try
                    {
                        myClient.GetStream().EndRead( r );
                    }
                    catch( Exception e )
                    {
                        // save the exception for later
                        //  assertion and validation
                        exception = e;
                    }
                },
                null
            );
        }
    );

    // connect to the TcpListener, so it can initiate an
    //  overlapped I/O operation
    client.Connect( Dns.GetHostName(), 8888 );

    // verify
    Assert.IsNull( exception );            
}       

The only changes here are that the listening thread is queued on the thread pool, and the test verifies that the exception remains null.  Using the thread pool seems to counter the kernel behavior and keep the outstanding overlapped I/O requests active even after the thread work unit is complete.



Log4Net Recommended Practices pt 1: Your Code

§ October 14, 2008 15:49 by beefarino |

So far my log4net writings have focused on specific aspects of the log4net framework.  These next posts are about putting the pieces together.

To that end, I would like to start with one recent example of how logging has saved hide.

The Call

My company phone starting cock-a-doodle-dooing about 3pm one afternoon.  It was the Director of Support calling - I'll refer to him as Bob - he was on-site trying to get a kiosk up and running for a major client, but a security configuration application was reporting a failure.  This application is basically a permissions engine that modifies DACLs on cryptographic keys, files, and registry keys based on a database of access rules.  The client was going live in an hour or so, so time was a critical factor.

"I've tried everything I know - the database is correct, the AD users are there, the security service is running, and I'm running as an administrator," he said nervously, "but it keeps failing at the same point on this one machine.  It worked on all the other machines, but this one kiosk keeps failing."

"Ok, let's open the log file.  It should be at <file path>.  Find it?" I asked.

"Um.... yep.  Ok, it's open."

"Alright.  Jump to the end, find the last line.  Read it to me."  

This is the layout of the log file for the application:

DATETIME LOGGERNAME [LEVEL] (CONTEXT) MESSAGE_STACKTRACE

The last two lines in the log appeared like so:

12:34:56 TaskExecutionContext [DEBUG] (ACLRuleBuilding) Attempting to locate computer domain name using WMI class [Win32_ComputerSystem] property [Domain].

12:34:56 SetCryptoKeyDACLTask [ERROR] (ACLRuleBuilding) An exception was raised while translating the user name to the domain: System.Runtime.InteropServices.COMException at ...

I had never encountered this error before in development, testing, or the two years the application's been used in the field.  However, the data in that one log message was enough for me to help Bob resolve the issue over the phone.  There were four bits of information that I gleaned from Bob's reading of that log message that allowed me to isolate the problem:

  1. The logger name told me that the SetCryptoKeyDACLTask class issued the error.
  2. The logging context told me that the error was occurring while the application was building ACL rules (as opposed to applying them).
  3. The log message told me the failure was occurring when the application was attempting to translate a username to the computer's domain.
  4. The preceding DEBUG log message indicates that the application was trying to find the computer's domain name immediately before the exception was raised.

"Hmmm, haven't seen that one before either," I said, "but it looks like the app is failing when it's trying to determine the computer's domain.  Any idea why this might be happening on this machine as opposed to the others?"

"Well, I had to add the machine to the domain and then ...."  I hear a loud *smack* as Bob hits himself in the head.  "Okay, I think I figured it out."

It turns out that Bob had neglected to reboot the kiosk after adding it to the domain, so when my application started spelunking through WMI to determine the computer's domain, an exception was being raised.  After a few minutes, the kiosk was running perfectly, and Bob and the client enjoyed a smooth go-live.

All things considered, Bob would have resolved this issue on his own eventually by blindly rebooting the machine and starting the security configuration process over again.  The point is that the application log contained enough information to isolate the problem to a very specific cause without resorting to looking at the source code.  COMExceptions are notoriously ambiguous; if that was all we had to go on, we may not have been able to determine the cause of the exception.

Some Recommended Practices

Let's take a look at some of the code that produced the log message:

namespace Log4Net_BestPractices1
{
    public class SetCryptoDACLTask : ITask
    {
        static log4net.ILog Log = log4net.LogManager.GetLogger(
            System.Reflection.MethodBase.GetCurrentMethod().DeclaringType
        );
        
        string userName;
        
        public void Execute( ITaskExecutionContext context )
        {        
            if( Log.IsDebugEnabled )
            {
                Log.DebugFormat( "Executing task under context [{0}]...", ( null == context ? "null" : context.Description ) );
            }
            
            // ...
            
            using( log4net.NDC.Push( "ACLRuleBuilding" ) )
            {
                // ...
                string principalName = null;
                
                try
                {
                    principalName =
                        context.TranslatePrincipalNameToDomain(
                            userName
                        );
                }
                catch( Exception e )                
                {
                    Log.Error(
                        "An exception was raised while translating the user name to the domain",
                        e
                    );
                    
                    throw;
                }
                
                // ...
            }
            
            // ...
            
            Log.Debug( "Task execution has completed." );
        }
    }
}

This code demonstrates some effective logging techniques I recommend you employ in your own applications:

  1. Use a unique logger object for each type in your application (see line 5).  This will automagically tag each log message with the class that wrote it.  Using this technique provides a log that reads like a novel of your application's activity.  
  2. Whenever you catch an exception, log it (see line 30).  Even if you just plan to throw it again, log it.  In addition, log4net knows how to format Exception objects, so don't try and build your own string from the exception data.
  3. Don't be afraid to pepper your application with piddly little log messages (see lines 15 and 47).  Each one becomes a marker of activity and can prove invaluable when trying to work an issue.  While logging does consume resources, that consumption can be controlled and optimized (something I'll talk about in the next post).  In all but the most CPU-intensive applications, the impact of logging won't be noticed when configured properly.
  4. Whenever you use one of the *Format overrides (see line 15), be extra-special-careful about validating the argument list.  It's very easy to forget to check for null references before using a property accessor, for example, because "it's just logging code."  
  5. Always remember that the argument list is evaluated before the logging method is called.  If an argument to the log method is expensive to obtain, be sure to guard your log statement with a check of the appropriate Is*Enabled property (see line 13).
  6. Before an object calls on a shared dependency, consider pushing a tag onto the log context stack (see line 20).  This will provide continuity in the log, allowing you to determine the caller of the shared code that logged a particular message.
  7. Whenever you use a formatted log statement, surround the format argument placeholders (the {}'s) with brackets or parentheses (see line 15).  Doing this will mark the areas in each log message that vary, making the log scan a bit easier.  In addition, it makes empty or null formatting arguments more obvious.  For instance, if the log message in line 15 was formatted with an empty context.Description value, the message would appear as "Executing task under context []..."  Without those brackets, the same message would be written as "Executing task under context ..." The former message makes it obvious that the context failed to provide a description value, the latter does not.

Coming Up

In upcoming posts I'll discuss log4net configuration scenarios and log reader clients.

In the meantime, if you have any additional logging practices you'd like to share, please feel free to comment on this entry!