Adding OCR info to a PDF
I have a good quality scan of a document; such scan is in pdf format.
How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.
pdfsandwich
Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:
pdfsandwich scanned.pdf
Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE
package) and setting the layout:
pdfsandwich -verbose -lang spa -layout single scanned.pdf
If you get any error please download last version deb from Sourceforge.
Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.
There are two projects which do the trick: GScan2PDF and OCRFeeder
I found a non-ideal solution, but a very effective one.
I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.
Thus you can search and copy text from this invisible layer.
A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:
https://github.com/jbarlow83/OCRmyPDF