Validate/verify PDF files integrity
It is quite easy to check if a PDF file is valid, by using PDFtk. A free GUI for PDFtk is available from PDF Labs. When you run this tool, you can load as many PDFs as you wish, from multiple directories (by using Add files button), and then it will start accessing the pages in these PDF files, very quickly.
If any file among the selected PDFs is not valid PDF, this utility will show a message about the error, and will remove it automatically from the selection window.
Hence you can save many hours using this procedure with PDFtk. Additionally, if you have multicore CPU, you can run multiple instances of this utility and throw in hundreds of PDFs in each instance.
I am using this software since last 1 year, and it is the most handy PDF tool I have ever used.
I've used "pdfinfo.exe" from xpdfbin-win package and cpdf.exe to check PDF files for corruption, but didn't want to involve a binary if it wasn't necessary.
I read that newer PDF formats have a readable xml data catalog at the end, so I opened the PDF with regular windows NOTEPAD.exe and scrolled down past the unreadable data to the end and saw several readable keys. I only needed one key, but chose to use both CreationDate and ModDate.
The following Powershell (PS) script will check ALL the PDF files in the current directory and output the status of each into a text file (!RESULTS.log). It took about 2 minutes to run this against 35,000 PDF files. I tried to add comments for those who are new to PS. Hope this saves someone some time. There's probably a better way to do this, but this works flawlessly for my purposes and handles errors silently. You might need to define the following at the beginning: $ErrorActionPreference = "SilentlyContinue" if you see errors on screen.
Copy the following into a text file and name it appropriately (ex: CheckPDF.ps1) or open PS and browse to the directory containing the PDF files to check and paste it in the console.
#
# PowerShell v4.0
#
# Get all PDF files in current directory
#
$items = Get-ChildItem | Where-Object {$_.Extension -eq ".pdf"}
$logFile = "!RESULTS.log"
$badCounter = 0
$goodCounter = 0
$msg = "`n`nProcessing " + $items.count + " files... "
Write-Host -nonewline -foregroundcolor Yellow $msg
foreach ($item in $items)
{
#
# Suppress error messages
#
trap { Write-Output "Error trapped"; continue; }
#
# Read raw PDF data
#
$pdfText = Get-Content $item -raw
#
# Find string (near end of PDF file), if BAD file, ptr will be undefined or 0
#
$ptr1 = $pdfText.IndexOf("CreationDate")
$ptr2 = $pdfText.IndexOf("ModDate")
#
# Grab raw dates from file - will ERR if ptr is 0
#
try { $cDate = $pdfText.SubString($ptr1, 37); $mDate = $pdfText.SubString($ptr2, 31); }
#
# Append filename and bad status to logfile and increment a counter
# catch block is also where you would rename, move, or delete bad files.
#
catch { "*** $item is Broken ***" >> $logFile; $badCounter += 1; continue; }
#
# Append filename and good status to logfile
#
Write-Output "$item - OK" -EA "Stop" >> $logFile
#
# Increment a counter
#
$goodCounter += 1
}
#
# Calculate total
#
$totalCounter = $badCounter + $goodCounter
#
# Append 3 blank lines to end of logfile
#
1..3 | %{ Write-Output "" >> $logFile }
#
# Append statistics to end of logfile
#
Write-Output "Total: $totalCounter / BAD: $badCounter / GOOD: $goodCounter" >> $logFile
Write-Output "DONE!`n`n"
Following the footsteps of @n0nuf, I wrote a batch script to check all PDFs in a specific folder with pdfinfo and push it through cpdf if broken as an attempt to fix them:
@ECHO OFF
FOR %%f in (*.PDF) DO (
echo %%f
pdfinfo "%%f" 2>&1 | findstr /I "error" >nul 2>&1
if not errorlevel 1 (
echo "bad -> try to fix"
@cpdf -i %%f -o %%f_.pdf 2>NUL
mv %%f .\\bak\\%%f
) else (
REM echo good
)
)
@ECHO ON
Or the same as bash script:
for file in $(find . -iname "*.pdf")
do
echo "$file"
pdfinfo "$file" 2>&1 | grep -i 'error' &> /dev/null
if [ $? == 0 ]; then
echo "broken -> try to fix"
cpdf -i "$file" -o "$file"_.pdf
fi
done
Broken PDFs will be moved to a subfolder \bak and the recreated PDFs get the suffix _.pdf (which is not perfect, but good enough for me). NOTE: A recreated PDF contains lesser errors and should be viewable with a regular PDF viewer. But this does not mean you get all your content back. Unrecoverable content leads to empty pages.
I also tried the same with JHOVE (Open source file format identification, validation & characterization tool) as suggested by @kraftydevil here: Check if PDF files are corrupted using command line on Linux and can now confirm this is also a valid approach. (First I had lesser success. But then I noticed I had not handled the output of JHOVE correctly.)
To test both approaches I deleted and altered random parts from a PDF with a text editor (removed streams, so pages failed to render in my PDF viewer, altered PDF Tags, and shifted some bits). The result is: Both pdfinfo and JHOVE are able to spot damaged files correctly (JHOVE was even more sensitive in some cases).
And here is the equivalent script for JHOVE:
@ECHO OFF
FOR %%f in (*.PDF) DO (
echo %%f
"C:\Program Files (x86)\JHOVE\jhove.bat" -m pdf-hul %%f | findstr /C:"Well-Formed and valid" >nul 2>&1
if not errorlevel 1 (
echo good
) else (
echo "bad -> try to fix"
@cpdf -i %%f -o %%f_.pdf 2>NUL
REM mv %%f .\\bak\\%%f
)
)
@ECHO ON
There's also the (relatively new) pdfcpu library/tool, which has validation functionality:
pdfcpu validate whatever.pdf
Note that at the time of writing (August 2020) pdfcpu is still in Alpha stage.