OCR on PDFs in OS X with free, open source tools

After reading these blog posts:

  • Linux, OCR and PDF - Problem solved
  • Creating a searchable PDF with opensource tools ghostscript, hocr2pdf and tesseract-ocr
  • Using Tesseract OCR with PDF scans

and going through the snippet below (from this gist) for Linux, I think I found a method to OCR a multi-page PDF and get a PDF in the output that could also work in OS X. Most of the dependencies are available in homebrew (brew install tesseract and brew install imagemagick), except one, hocr2pdf.

I haven't been able to find a port of it for OS X. Is there one available? If not, how can one OCR a multi-page PDF and get the results back again in a multi-page PDF in OS X, using free, open source tools?

#!/bin/bash

# This is a script to transform a PDF containing a scanned book into a searchable PDF.
# Based on previous script and many good tips by Konrad Voelkel:
# http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/
# http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/
# Depends on convert (ImageMagick), pdftk and hocr2pdf (ExactImage).
# $ sudo apt-get install imagemagick pdftk exactimage
# You also need at least one OCR software which can be either tesseract or cuneiform.
# $ sudo apt-get install tesseract-ocr
# $ sudo apt-get install cuneiform
# To install languages into tesseract do (e.g. for Portuguese):
# $ sudo apt-get install tesseract-ocr-por

echo "usage: ./pdfocr.sh document.pdf ocr-sfw split lang author title"
# where ocr-sfw is either tesseract or cuneiform
# split is either 0 (already single-paged) or 1 (2 book-pages per pdf-page)
# lang is a language as in "tesseract --list-langs" or "cuneiform -l".
# and author, title are used for the PDF metadata.
#
# usage example:
# ./pdfocr.sh SomeFile.pdf tesseract 1 por "Some Author" "Some Title"
pdftk "$1" burst dont_ask
for f in pg_*.pdf
do
if [ "1" == "$3" ]; then
convert -normalize -density 300 -depth 8 -crop 50%x100% +repage $f "$f.png"
else
convert -normalize -density 300 -depth 8 $f "$f.png"
fi
done
rm pg_*.pdf

for f in pg_*.png
do
if [ "tesseract" == "$2" ]; then
tesseract -l $4 -psm 1 $f $f hocr
elif [ "cuneiform" == "$2" ]; then
cuneiform -l $4 -f hocr -o "$f.html" $f
else
echo "$2 is not a valid OCR software."
fi
hocr2pdf -i $f -r 300 -s -o "$f.pdf" < "$f.html"
done

pdftk pg_*.pdf cat output merged.pdf

pdftk merged.pdf update_info_utf8 doc_data.txt output merged+data.pdf
echo "InfoBegin" > in.info
echo "InfoKey: Author" >> in.info
echo "InfoValue: $5" >> in.info
echo "InfoBegin" >> in.info
echo "InfoKey: Title" >> in.info
echo "InfoValue: $6" >> in.info
echo "InfoBegin" >> in.info
echo "InfoKey: Creator" >> in.info
echo "InfoValue: PDF OCR scan script" >> in.info
in_filename="${1%.*}"
pdftk merged+data.pdf update_info_utf8 in.info output "$in_filename-ocr.pdf"

rm -r doc_data.txt in.info merged* pg_*

Tesseract 3.03+ has built in support for PDF output. Which requires leptonica to be installed. You can use: brew install tesseract --HEAD to get the latest version of tesseract. You will also need ghostscript installed but no need for hocr2pdf.

The following script uses ghostscript to split the PDF into JPEGs, tesseract to OCR the JPEGs and output single PDF pages, and finally ghostscript again to combine the pages back into one PDF.

#!/bin/sh

y="`pwd`/$1"
echo Will create a searchable PDF for $y

x=`basename "$y"`
name=${x%.*}

mkdir "$name"
cd "$name"

# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%04d.jpg -f "$y"

# process each page
for f in $( ls *.jpg ); do
  # extract text
  tesseract -l eng -psm 3 $f ${f%.*} pdf
  rm $f
done

# combine all pages back to a single file
gs -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile="../${name}_searchable.pdf" *.pdf

cd ..
rm -rf "${name}"

# Adapted from: http://www.morethantechnical.com/2013/11/21/creating-a-searchable-pdf-with-opensource-tools-ghostscript-hocr2pdf-and-tesseract-ocr/
# from http://www.ehow.com/how_6874571_merge-pdf-files-ghostscript.html
# bash tut: http://linuxconfig.org/bash-scripting-tutorial
# Linux PDF,OCR: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/

I use tesseract on os x too. Wrote about automating it briefly here.