How to replaces images of text in PDFs with formatted text using OCR

Solution 1:

Even Adobe's own software is not good at doing this or making clear how to do it.

With Adobe Acrobat X, you can create a text layer through the menus (View | Tools | Recognize Text) or by click Tools in the toolbar and then Recognize Text in the Tools pane.

You then have options to perform OCR on the document or find "suspects". The "suspects" are possible OCR results that don't look right (don't spellcheck?). Once you have gone through the suspects, there doesn't seem to be any way to access or edit the text layer again short of redoing the OCR.

You can choose page ranges to limit OCR (e.g. if you have a multilingual document), but you can't limit it to a selection.

Given that this is such a useful feature, it's disappointing that Adobe don't make it very user-friendly.

Edit: Two other possible solutions.

Adobe Acrobat using ClearScan

When you perform OCR with Adobe Acrobat you can change the PDF Output Style from the default Searchable Image format to ClearScan. This format will actually change the image as well, replacing characters with outlines derived from the OCR. This would both make your PDF more readable and add a text layer, but it does change the original image.

Infix PDF Editor

This program does seem to be able to display the text layer, but it still seems tricky fixing places where Adobe's OCR goes wrong (e.g. lone words in their own positioned para).

Sadly none of these options are freely available.