Automator-script with an OCR-software to automatically add OCR to material? [duplicate]

Solution 1:

It's not entirely clear to me what your requirements are for being able to "script" this from the "command line".

If you are talking about automation, then that is possible with any number of utilities.

ABBYY FineReader Express + Keyboard Maestro + Hazel

I use ABBYY FineReader Express + Keyboard Maestro + Hazel like so:

  1. Hazel monitors a given folder for any new PDFs

  2. if a PDF is found, it is opened in "ABBYY FineReader Express"

  3. Keyboard Maestro then automates the process of turning the PDF into a Searchable PDF (OCR) and saves the file to a different directory.

Now, if you don't own Hazel and Keyboard Maestro already, your initial costs are going to rise pretty quickly (although I depend on both so much I consider them a bargain).

PDFPen + AppleScript + Folder Actions

You could do something similar with PDFPen (or PDFPenPro) and folder actions and AppleScript. See https://gist.github.com/prenagha/1355037 for one example.

Marco Arment did a survey of OCR apps for Mac and found that PDFPen had great results and was easy to automate.

A google search for "PDFpen applescript OCR" will turn up a number of alternatives.

Solution 2:

What you want is Tesseract OCR. It's an open source OCR that is maintained by Google and supports a variety of platforms. It also has a native command line interface. It's exactly what you're looking for and available from the Mac ports project as well as homebrew.

Project Home: https://github.com/tesseract-ocr

How to install on OS X: http://blog.matt-swain.com/post/26419042500/installing-tesseract-ocr-on-mac-os-x-lion

Usage Example: tesseract -l eng input.pdf output

Solution 3:

Disclaimer: NOT AN OCR SOLUTION (but this answer is still useful to extract text from pdf)

There is an Apache Software Foundation project called Apache Tika:

A toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries

They support PDF text extraction using PDFBox:

allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities

And they recently also added support for OCR (via Tesserac)

For a text based solution, PDFBox makes very simple to extract text from a PDF:

  • Download the pdfbox-app package from https://pdfbox.apache.org/downloads.html
  • run the ExtractText command on it:

    java -jar pdfbox-app-x.y.z.jar ExtractText myNiceBook.pdf myNiceBook.txt

It also has some other nice options that you can see in ExtractText docs.

Solution 4:

A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:

https://github.com/jbarlow83/OCRmyPDF

Solution 5:

You can make your existing PDF searchable by converting it into text file. You need for that at least Imagemagick, Ghostscript (for PDF conversion) and Tesseract OCR tool.

Some command-line example:

$ wget http://www.fmwconcepts.com/misc_tests/pdf_tests/test.pdf
$ convert -density 300 -depth 8 test.pdf test.png
$ tesseract test*.png test.txt
$ grep -i --color=auto the test*.txt
**The** details as told by surviving crew members, to **the** German publication Spiegel and published on ABC's

This can be extended further to your needs.

To install required tools, on OSX you may install it via Homebrew:

brew install imagemagick jpeg libpng ghostscript tesseract

On Linux use apt-get or yum instead of brew.

For more OCR tools, check: OCR on Linux systems

Related:

  • Doing OCR Using Command Line Tools in Linux
  • Working with PDFs Using Command Line Tools in Linux