ConvertFrom-PDF PowerShell Cmdlet

§ July 8, 2009 06:14 by beefarino |

I hate PDFs. 

And now I need to search through several hundred of them, ranging from 30 to 300 pages in length, for cross-references and personnel names which ... um ... well, let's just say they no longer apply.  Sure reader has the search feature built-in, so does explorer, but that's so 1980's.  And I sure don't want to do each one manually...

I poked around the 'net for a few minutes to find a way to read PDFs in powershell, but no donut.  So I rolled my own cmdlet around the iTextSharp library and Zollor's PDF to Text converter project.

There isn't much to the cmdlet code, given that all of the hard work of extracting the PDF text is done in the PDFParser class of the converter project:

using System;
using System.IO;
using System.Management.Automation;
namespace PowerShell.PDF
{
    [Cmdlet( VerbsData.ConvertFrom, "PDF" )]
    public class ConvertFromPDF : Cmdlet
    {
        [Parameter( ValueFromPipeline = true, Mandatory = true )]
        public string PDFFile { get; set; }
        
        protected override void ProcessRecord()
        {
            var parser = new PDFParser();
            using( Stream s = new MemoryStream() )
            {
                if( ! parser.ExtractText(File.OpenRead(PDFFile), s) )
                {
                    WriteError( 
                        new ErrorRecord(
                            new ApplicationException(),
                            "failed to extract text from pdf",
                            ErrorCategory.ReadError,
                            PDFFile
                        )    
                    );
                    return;
                }
                s.Position = 0;
                using( StreamReader reader = new StreamReader( s ) )
                {
                    WriteObject( reader.ReadToEnd() );
                }
            }
        }
    }
}

The code accepts a file path as input; it runs the conversion on the PDF data and writes the text content of the file to the pipeline.  Not pretty, but done.

Usage

Here is the simple case of transforming a single file:

> convertfrom-pdf -pdf my.pdf

or

> my.pdf | convertfrom-pdf 

More complex processing can be accomplished using PowerShell's built-in features; e.g., to convert an entire directory of PDFs to text files:

> dir *.pdf | %{ $_ | convertfrom-pdf | out-file "$_.txt" } 

More relevant to my current situation would be something along these lines:

> dir *.pdf | ?{ ( $_ | convertfrom-pdf ) -match "ex-employee name" } 

Download the source: PowerShell.PDF.zip (1.10 mb) 

Enjoy!




Creating a PowerShell Provider pt 3: Getting Items

§ July 7, 2009 04:51 by beefarino |

Now that the drive object of the ASP.NET Membership PowerShell provider is fully functional, it's time to extend the PowerShell provider to fetch users from the membership store.

The PowerShell cmdlet used to retrieve one or more items from a provider is get-item, or gi for you alias monkeys out there.  get-item works against any provider that support item retrieval; e.g., it works for files and folders:

> get-item c:\temp
    Directory: Microsoft.PowerShell.Core\FileSystem::C:\
Mode           LastWriteTime       Length Name
----           -------------       ------ ----
d----     3/12/2009  2:23 PM        <DIR> temp

It works for environment variables:

> get-item env:PROGRAMFILES
Name                           Value
----                           -----
ProgramFiles                   C:\Program Files

It works for certificates:

> get-item cert:/CurrentUser/My/E0ADE6F1340FDF59B63452D067B91FFFA09A621F
    Directory: Microsoft.PowerShell.Security\Certificate::CurrentUser\My
Thumbprint                                Subject
----------                                -------
E0ADE6F1340FDF59B63452D067B91FFFA09A621F  E=jimbo@null.com, CN=127.0.0.1, OU=sec, O=ptek, L=here, S=nc, C=US

So, how does PowerShell know which provider to invoke when it processes a get-item cmdlet?  

PowerShell Paths

The item's full path contains all of information necessary to locate the appropriate provider.  In the examples above, the drive portion of the path indicates which provider should be used to process the item request; you can see the mapping of drives to providers using the get-psdrive cmdlet:

> get-psdrive
Name       Provider      Root
----       --------      ----       
Alias      Alias
C          FileSystem    C:\   
cert       Certificate   \
Env        Environment
Function   Function
Gac        AssemblyCache Gac
HKCU       Registry      HKEY_CURRENT_USER
HKLM       Registry      HKEY_LOCAL_MACHINE
Variable   Variable

Each PowerShell drive is directly associated with a provider type: the C: drive maps to the FileSystem provider, the cert: drive to the Certificate provider, and the env: drive to the Environment provider.

PowerShell recognizes several forms of path syntax, all of them designed to allow for provider discovery; providers are expected to support these path formats as appropriate:

  • Drive-qualified: this is equivalent to a fully-qualified file path.  The drive is explicitly specified at the start of the path, as in "c:/temp/junk.txt" or "cert:/localmachine".  
  • Provider-direct: starts with "\\" (or "//"); the provider for the current location (i.e., the value of $pwd) is assumed to be the provider of the path. This syntax is often used to identify resources on other machines, such as UNC paths or remote registry hives.
  • Provider-qualified: the provider name is prepended to the drive-qualified or provider-direct item path, delimited by '::'.  E.g., "FileSystem::c:/temp/junk.txt", or "FileSystem::\\share\temp\junk.txt".  This format is used when the appropriate provider must be explicity stated.
  • Provider-internal: this is the portion of the provider-qualified path following the '::' delimiter. 

Of the four supported path formats, the ASP.NET Membership PowerShell provider will support these three:

  • Drive-qualified: users:/username
  • Provider-qualified: ASPNETMembership::users:/username
  • Provider-internal: this is idential to the drive-qualified path syntax

Provider-direct paths and UNC-style provider-internal paths will not be supported by the ASP.NET Membership PowerShell provider.

Knowing the path formats to expect, it's time to implement support for the get-item cmdlet.

Enabling Get-Item

Enabling item cmdlet support for the ASP.NET Membership PowerShell provider begins with deriving the provider from ContainerCmdletProvider:

using System.Management.Automation;
using System.Management.Automation.Provider;
namespace ASPNETMembership
{
    [CmdletProvider( "ASPNETMembership", ProviderCapabilities.None )]
    public class Provider : ContainerCmdletProvider
    {
        // ...
    }
}

Deriving from ContainerCmdletProvider adds many item-related methods to the provider.  To enabling the get-item cmdlet, at least two of these methods must be overridden. 

GetItem

The first required override is the GetItem method, which is called to process a get-item invocation at runtime:

protected override void GetItem( string path )
{
    var user = GetUserFromPath(path);
    if( null != user )
    {
        WriteItemObject( user, path, false );
    }
}

The GetItem override delegates almost all of the work to the GetUserFromPath utility method; if GetUserFromPath returns a valid object reference, it is written back to the current pipeline using the WriteItemObject method of the provider object's base (line 6).

GetUserFromPath uses the provider's custom drive object to access the ASP.NET Membership provider.  The drive object for the path is available in the PSDriveInfo property; PowerShell conveniently sets this value to the appropriate DriveInfo object for the item's path before calling GetItem:

MembershipUser GetUserFromPath( string path )
{
    var drive = this.PSDriveInfo as MembershipDriveInfo;
    var username = ExtractUserNameFromPath( path );
    return drive.MembershipProvider.GetUser( username, false );
}
static string ExtractUserNameFromPath( string path )
{
    if( String.IsNullOrEmpty( path ) )
    {
        return path;
    }
    // this regex matches all supported powershell path syntaxes:
    //  drive-qualified - users:/username
    //  provider-qualified - membership::users:/username
    //  provider-internal - users:/username
    var match = Regex.Match( path, @"(?:membership::)?(?:\w+:[\\/])?(?<username>[-a-z0-9_]*)$", RegexOptions.IgnoreCase );
    if( match.Success )
    {
        return match.Groups[ "username" ].Value;
    }
    return String.Empty;
}

The custom drive object exposes the ASP.NET Membership Provider, which offers a GetUser method that returns the MembershipUser object for a valid username (line 5).  The username is extracted from the path string using a simple regular expression that matches the three path formats supported by the PowerShell provider (line 17).

ItemExists

The second required override is the ItemExists method, which is called by PowerShell to determine if the provider contains an item at a specified path (e.g., by the test-path cmdlet).

PowerShell calls ItemExists before the GetItem method when processing get-item; if ItemExists returns false, GetItem is not called and a "cannot find path" error is reported on the pipeline.  The ASP.NET Membership provider reuses the GetUserFromPath utility method to ascertain whether the path contains a valid username:

protected override bool ItemExists( string path )
{
    return null != GetUserFromPath( path );
} 

With these two overrides and their supporting utility methods, our provider can support the get-item cmdlet.

Testing Get-Item

Build and run; in the PowerShell console, create the users drive as follows:

> new-psdrive -psp aspnetmembership -root "" -name users -server localhost -catalog awesomewebsitedb; 
Name       Provider      Root                                   CurrentLocation
----       --------      ----                                   ---------------
users      ASPNETMemb... 

Once the drive is created, you can use get-item to fetch MembershipUser objects from the ASP.NET Membership user store:

> get-item users:/testuser
PSPath                  : ASPNETMembership::testuser
PSDrive                 : users
PSProvider              : ASPNETMembership
PSIsContainer           : False
UserName                : testuser
ProviderUserKey         : 09a9c356-a400-4cff-825d-231207946c94
Email                   : user@hotmail.com
PasswordQuestion        : what is your favorite color?
Comment                 :
IsApproved              : True
IsLockedOut             : False
LastLockoutDate         : 12/31/1753 7:00:00 PM
CreationDate            : 6/11/2009 12:59:45 PM
LastLoginDate           : 6/11/2009 12:59:45 PM
LastActivityDate        : 6/11/2009 12:59:45 PM
LastPasswordChangedDate : 6/11/2009 12:59:45 PM
IsOnline                : False
ProviderName            : AspNetSqlMembershipProvider

At this point, a whole new world of complexity is available from our provider:

> ( get-item users:/testuser ).ResetPassword( 'asdf1234' )
^PlpmNMON@7A]w

We can also leverage some of the built-in goodies of PowerShell against our ASP.NET Membership store in a natural way:

$u = get-item users:/testuser;
if( $u.IsLockedOut ) 
{ 
    $u.UnlockUser(); 
}  

Pretty cool.

Coming Up

Discovery is a big part of PowerShell, and in the post I'll extend the ASP.Net Membership PowerShell provider to support the get-childitem (alias dir or ls) cmdlet, to enable listing of all users in the store.  I'll also add support for the set-location (alias cd) cmdlet, which will allow operators to set the shell's current location to our custom users drive.

The code for this post is available here: ASPNETMembership_GetItem.zip (5.55 kb)

As always, thanks for reading, and if you liked this post, please Kick It, Shout It, trackback, tweet it, and comment using the clicky thingies below!



Twitter Appender for Log4Net

§ July 7, 2009 04:13 by beefarino |

Casey Watson just published a post describing how to create a Twitter appender for log4net!  Not only is the idea creative and fun, but it's also great sample code if you want to build your own custom appenders.

Please check it out here: http://caseywatson.com/2009/07/07/log4net-twitter-awesome/

NICE WORK CASEY!



Death by Logging Mantra #1 - Logs Consume Space

§ June 15, 2009 18:03 by beefarino |

As much as I love and lean on logging, it's not immune from problems.  In fact, it can be the source of some serious headaches.  A recent snafu at work prompted me to write about some situations where logging can bring your application performance to a screeching halt, or crash your application altogether. 

Here's what happened...

An incredibly complex system of hardware and software had been running smoothly for months; as part of an instrumentation layer we opted to change the rolling log strategy from 50 10MB text files:

...
<appender name="TextFile" type="log4net.Appender.RollingFileAppender">
    <file value="logs\log.txt" />
    <appendToFile value="true" />
    <rollingStyle value="Size" />
    <maxSizeRollBackups value="50" />
    <maximumFileSize value="10MB" />
    <layout type="log4net.Layout.XMLLayout">
      <prefix value="" />
    </layout>
</appender>
... 

to 500 1MB xml files:

...
<appender name="XmlFile" type="log4net.Appender.RollingFileAppender">
    <file value="logs\log.xml" />
    <appendToFile value="true" />
    <rollingStyle value="Size" />
    <maxSizeRollBackups value="500" />
    <maximumFileSize value="1MB" />
    <staticLogFileName value="false" />
    <countDirection value="1" />
    <layout type="log4net.Layout.XMLLayout">
      <prefix value="" />
    </layout>
</appender>
... 

As an ardent log4net user, I am aware of the performance impact of rolling a large number of files - if the CountDirection setting of the RollingFileAppender is less than 0 (which it is by default), the system will rename every log file each time the log rolls over.  This is costly, and in our product configuration that would mean up to 500 file renames on each roll. 

"Good thing I know what I'm doing...."

Several hours after firing up the new configuration a college asked me to come look at the device.  It had slowed to a wounded crawl.   I went to dig into the logs - I popped over to the log drive and started to find the most recent one.

... but there were 2000 log files, not 500.  The 2GB drive dedicated to the log was completely full.  And the application was still trying to write new log entries, which meant a slew of IO Exceptions were being continuously thrown by multiple threads.

"Um, I think I may have found the problem..."

I removed the oldest 1999 log files and the device immediately recovered.

So what happened?

The configuration XML is 100% correct.  The problem was that I accidentally deployed the software to the device with an old beta version of log4net 1.2.9; that particular version contains a bug in the RollingFileAppender code that prevents the MaxSizeRollBackups from being honored when CountDirection was >= 0.  Without the logs being limited in number, the software eventually filled up the hard disk with log entries.

Which bring me to my first death-by-logging mantra...

Logs Consume Space

It sounds silly I know, but this is the single most prevalent antipattern I see with any logging implementation.  There is a finite amount of storage space, and you need to make sure your logging activity doesn't consume more than its share.

I frequently see this when apps use FileAppender - this beast has no chains and, as I've stated elsewhere, you should never ever use it.  Ever.  Even in "little" applications it can cause massive trouble because the log accumulates over process time or application runs with no checks.  I've seen a 1KB application with 3GB of logs spanning almost a year of activity.

But don't think the problem is limited to the file-based appenders.  Remember....

  • memory is finite...
  • event logs fill up more often than you think...
  • a database can be configured to boundlessly expand as needed...
  • mail servers generally put caps on the size of an inbox...

Whichever appender strategies you choose, you should carefully consider the following:

... How much persistent storage are you allocating to the log?  Your answer should be a firm number, like "500MB", and not "the rest of the disk".  If you can, base this on the amount of information you need to have access to.  If a typical run of your application results in 10MB of logs, you can base the allocated size on the number of runs you want to persist.  If your application runs continuously - like a web site or windows service - you can plot out the logging activity over time, then gauge the allocation size from the time span of activity you want to persist.

... How will you cope when the allocated storage is used up?  Some appenders, like the RollingFileAppender, can handle this for you by freeing up space used by old log entries.  Others, like the EventLogAppender or the AdoNetAppender, blithely log without regard to the amount of space being consumed, and it's up to you to manage the size of the log in other ways.  E.g., I've seen SQL jobs dedicated to removing log records older than N hours, or truncating the log table to the N newest records.

... What happens when you log to a full repository?  Plan for success, but understand the causes of failure.  As I recently learned, our application slows down significantly when the log drive is full, so now checking the free space of the log drive is now included in our troubleshooting guide as a checklist item under "Application Performance Issues".  Test your application under limited logging resources to understand how it will behave.

The most important thing to remember is that logging, like any other subsystem of your application, needs to be planned, tested, and verified.