How to create PDF with scanned pages but selectable text?

This has (contrary to some other answers here) most probably nothing to do with Acrobat at all.

Most (all?!) professional document scanners and most semi-professional ones will automatically perform OCR when you choose "Save as PDF" and have the "searchable" checkbox ticked in the settings. The cheaper "consumer grade" models will do the OCR on the attached PC, typical network scanners do it internally.

The word "searchable" means nothing more and nothing less than that the scanner will perform OCR, then generate a page with the scanned bitmaps within, and overlay them with invisible characters from the OCR, each placed over the respective character on the bitmap.

That way, you can search, and also select, copy, and paste the "bitmap" as if by magic. It's no magic at all, however. In reality, you're just copying invisible text.

The scanner may also do some additional magic such as compositing the large image from many small tiles which are also reused. This results in a much smaller document size than would actually be possible, but may also lead to funny surprises (not so funny if they happen to you!) such as the Xerox alters your bills story, ironically even when no OCR is done, depending on the firmware.


But how is this possible?

Basically, a program performs OCR on the input file and then it places an invisible layer of text over the picture. Alternatively, it might also place a visible layer of text under the picture, giving the same effect.

When you select something, the picture doesn't matter because the text layer gets selected.

how can this be created?

There are several ways. Given that Acrobat has already been suggested, I will add some free options (and luckily you are not forced to have Windows to use them).

PDF-XChange Viewer

This is a native Windows program by Tracker Software. The freeware version runs fine under Wine if you use the 32-bit edition in a 32-bit prefix, therefore you can use it on Windows, macOS and Linux. In the last two cases, you would need PlayOnMac or PlayOnLinux respectively.

Here's a picture from this answer I left on Ask Ubuntu:

Screenshot of PDF-XChange Viewer under Wine

OCRmyPDF

This is a multiplatform program written in Python, based on Ghostscript, Tesseract and Unpaper. From the docs:

What OCRmyPDF does

OCRmyPDF analyzes each page of a PDF to determine the colorspace and resolution (DPI) needed to capture all of the information on that page without losing content. It uses Ghostscript to rasterize the page, and then performs on OCR on the rasterized image to create an OCR “layer”. The layer is then grafted back onto the original PDF.

It can be easily installed on Debian and Ubuntu derivatives:

apt-get install ocrmypdf

Or on macOS:

brew tap jbarlow83/ocrmypdf
brew install ocrmypdf

On Windows you would need to use the Docker image. See the official docs for details.

Usage is very simple and I suggest you use the optional -d (deskew) and -c (clean) parameters for better results. It will straighten every page and clean up small dots/imperfections before running the OCR process.

You can (and should) provide the language with -l.

Here's an example taken from this skewed document written in Italian:

Example for OCRmyPDF

The command I used was:

ocrmypdf -l ita -d -c input.pdf output.pdf

Online tools

There are a few online tools that do the same. Notable, PDF24 hosts a free web-based version of OCRmyPDF that can be used without limitations.

See also:

  • ocr.space
  • Cvision online OCR
  • LeadTools JS based demo with OCR

This is possibly because of a Acrobat OCR feature:

Acrobat can recognize text in any PDF or image file in dozens of languages. All you have to do is open the scanned document or image that you'd like to OCR, then click the blue Tools button in the top right of the toolbar. In that sidebar, select the Recognize Text tab, then click the In This File button.

...

With the text recognized, you can now markup the PDF using all the normal markup tools — you can highlight, cross out text, and more. You can even copy the text with the detected formatting, though that's often less accurate than the text recognition itself.


From Adobe's website

Recognize text in a Scanned PDF file

When you scan paper documents to PDF, you’re really just taking pictures of those documents. That’s great for photos and other printed images, but what if you’ve got a 200-page document in which you need to find a particular word or phrase? Use Acrobat to recognize the text in that scanned file, making the text content searchable and usable.

  1. With your scanned document open in Acrobat, open up the Tools pane and expand the Text Recognition panel. If you can’t see “Text Recognition” in the Tools pane, you can add it by selecting the menu in the upper right corner (image below – see where that little red arrow is pointing? Click there).
  2. Click on “In This File” to scan the document you’ve got open. You can just accept the default settings and click “Okay” when the Recognize Text box pops up. Acrobat will convert the image into usable text; to test it out, just try editing a word or sentence with the Content Editing panel. Isn’t that awesome!?