Search pdf contents with powershell and output a file list
Here's what I'm trying to do:
I have a huge mess of files (around ten thousand) of various formats. Each file can be defined as a certain type (ex: product sheet, business plan, offer, presentation, etc). The files are in no particular order and might as well be looked at as a single list. I'm interested in creating a catalogue by type.
The idea is that, for a certain format and a certain type, I know what keywords to look for in the file's contents. I would like to have a powershell script that basically executes a series of scripts looking for all the files of a certain format containing specific keywords and outputting each list to a separate csv. The crucial point here is that the keyword will be in the content (body of a pdf, cell of an excel etc.) and not in the filename. As of now I've tried the following:
get-childitem -Recurse | where {!$_.PSIsContainer} |
select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -delimiter '|' -path C:\Users\Uzer\Documents\file.csv -encoding default
That is nice and gives me the complete list of files including their size and extension. I'm looking for something similar but filtering by content. Any ideas?
Edit: based on the solution below her's the new code:
$searchstring = "foo"
$directory = Get-ChildItem -include ('*.pdf') -Path "C:\Users\Uzer\Searchfolder" -Recurse
foreach ($obj in $directory)
{Get-Content $obj.fullname | Where-Object {$_.Contains($searchstring)}| select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -delimiter '|' -path C:\Users\Uzer\Documents\file2.csv -encoding default}
However I get a bunch of these errors:
An object at the specified path C:[blabla]\filename.pdf does not exist, or has been filtered by the -Include or -Exclude parameter.
Solution 1:
Powershell using itextsharp.dll. The below evaluates the text on each page of each pdf for keywords, then exports any matches to a csv. You can run with this to rename files if matches are found, move them to categorized folders, and the likes.
EDIT: Github page for itextsharp indicates it is end-of-life and links to Itext7 https://github.com/itext/itext7-dotnet (dual licensed as AGPL/Commercial software, seems free for non-commercial use.)
Add-Type -Path "C:\path_to_dll\itextsharp.dll"
$pdfs = gci "C:\path_to_pdfs" *.pdf
$export = "C:\path_to_export\export.csv"
$results = @()
$keywords = @('Keyword1','Keyword2','Keyword3')
foreach($pdf in $pdfs) {
Write-Host "processing -" $pdf.FullName
# prepare the pdf
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $pdf.FullName
# for each page
for($page = 1; $page -le $reader.NumberOfPages; $page++) {
# set the page text
$pageText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A)
# if the page text contains any of the keywords we're evaluating
foreach($keyword in $keywords) {
if($pageText -match $keyword) {
$response = @{
keyword = $keyword
file = $pdf.FullName
page = $page
}
$results += New-Object PSObject -Property $response
}
}
}
$reader.Close()
}
Write-Host ""
Write-Host "done"
$results | epcsv $export -NoTypeInformation
The console output:
processing - C:\path_to_pdfs\1.pdf
processing - C:\path_to_pdfs\2.pdf
processing - C:\path_to_pdfs\3.pdf
processing - C:\path_to_pdfs\4.pdf
processing - C:\path_to_pdfs\5.pdf
done
PS C:\>
The csv output:
keyword page file
Keyword2 14 C:\path_to_pdfs\3.pdf
Keyword3 22 C:\path_to_pdfs\3.pdf
Keyword1 6 C:\path_to_pdfs\5.pdf
Solution 2:
If the file contents of the PDF are indexed in Windows Search, you can query the system filesystem index. You may need to install an iFilter to ensure that Windows will index PDFs. But this method will then work with pdf, text files, xlsx files, etc.
$searchString = "foo"
$searchPath = "C:\Users\Uzer\Searchfolder"
$sql = "SELECT System.ItemPathDisplay, System.DateModified, " +
"System.Size, System.FileExtension FROM SYSTEMINDEX " +
"WHERE SCOPE = '$searchPath' AND FREETEXT('$searchstring')"
$provider = "provider=search.collatordso;extended properties=’application=windows’;"
$connector = new-object system.data.oledb.oledbdataadapter -argument $sql, $provider
$dataset = new-object system.data.dataset
if ($connector.fill($dataset)) { $dataset.tables[0] }