Any tools to automate OCR of scanned PDF files in a manner similar to Acrobat's OCR feature? [closed]

Open source preferred, but not necessary.

I've got Adobe Acrobat 8, and really like the OCR feature which can essentially put an invisible layer of OCR'd text on top of a scanned document. Thus what you see on screen is the original scanned document, but the result is searchable.

What I'm looking for is a way to automate this process. I've currently got a few scripts that we use for processing and archiving scanned files, and am looking for something that I can plug right in to this batch process to do OCR in a manner similar to what I can do with Acrobat.

All suggestions welcome, thanks!


I have this implemented in a company document archveiving project. Scanned file is a tif file(single page). Then using Cuneiform to create a hocr file of the single tif. Then using hocr2pdf to output the PDF file. If multiple scan pages, I use gs to combine the PDFs into a single PDF document. Works really well, OCR is good enough for our needs and is searchable in any PDF viewer.


Have you looked at WatchOCR? You can download it from http://www.watchocr.com It is a free and open source OCR server that transforms image only pdfs into text searchable pdfs from a watched folder or network share.