I hate PDFs. 

And now I need to search through several hundred of them, ranging from 30 to 300 pages in length, for cross-references and personnel names which ... um ... well, let's just say they no longer apply.  Sure reader has the search feature built-in, so does explorer, but that's so 1980's.  And I sure don't want to do each one manually...

I poked around the 'net for a few minutes to find a way to read PDFs in powershell, but no donut.  So I rolled my own cmdlet around the iTextSharp library and Zollor's PDF to Text converter project.

There isn't much to the cmdlet code, given that all of the hard work of extracting the PDF text is done in the PDFParser class of the converter project:

using System;
using System.IO;
using System.Management.Automation;
namespace PowerShell.PDF
{
    [Cmdlet( VerbsData.ConvertFrom, "PDF" )]
    public class ConvertFromPDF : Cmdlet
    {
        [Parameter( ValueFromPipeline = true, Mandatory = true )]
        public string PDFFile { get; set; }
        
        protected override void ProcessRecord()
        {
            var parser = new PDFParser();
            using( Stream s = new MemoryStream() )
            {
                if( ! parser.ExtractText(File.OpenRead(PDFFile), s) )
                {
                    WriteError( 
                        new ErrorRecord(
                            new ApplicationException(),
                            "failed to extract text from pdf",
                            ErrorCategory.ReadError,
                            PDFFile
                        )    
                    );
                    return;
                }
                s.Position = 0;
                using( StreamReader reader = new StreamReader( s ) )
                {
                    WriteObject( reader.ReadToEnd() );
                }
            }
        }
    }
}

The code accepts a file path as input; it runs the conversion on the PDF data and writes the text content of the file to the pipeline.  Not pretty, but done.

Usage

Here is the simple case of transforming a single file:

> convertfrom-pdf -pdf my.pdf

or

> my.pdf | convertfrom-pdf 

More complex processing can be accomplished using PowerShell's built-in features; e.g., to convert an entire directory of PDFs to text files:

> dir *.pdf | %{ $_ | convertfrom-pdf | out-file "$_.txt" } 

More relevant to my current situation would be something along these lines:

> dir *.pdf | ?{ ( $_ | convertfrom-pdf ) -match "ex-employee name" } 

Download the source: PowerShell.PDF.zip (1.10 mb) 

Enjoy!