Get printer-ready black text on white background in scanned pdf files (remove grayscale or color background)

How can I turn photos of paper documents into a scanned document? is related, but not the same, as I'm talking about pdf files. The processing of images seems complicated in the answers under the linked question, especially because it involves processing each image separately: given my pdf has hundreds of pages, the solution I expect is not that of processing/editing images, but simply of scanning digital photos and documents the way real ones are. I mean something like a "virtual scanner" for which the input would be a photo-based pdf or collection of photos and the output a "normal" scanned document. (Also the Scantailor tool recommended - also here - seems to lack a Linux version now.)


This is not about OCR and not about converting image to text.

To clarify what I mean I will post a few examples.

There are pdf files based on text, not image, and they are text files (let's say docx or odt) exported to pdf. They look ready to be printed:

enter image description here

The above is not what I discuss here.

What I'm interested in are the pdfs in the images below, namely the difference between scanned text pages that look too much like images and scanned text pages that look like digitized text.

The first are formed of images that look like pictures taken of book pages:

enter image description here

or

enter image description here

Such copies can hardly be re-printed on paper, as the background will be printed too.

The second ones are what one would expect from scanned text, and can be printed:

enter image description here

or

enter image description here

The picture-like pdf may already be OCR-processed and its text searchable, and still look like a collection of (page) photos: OCR is not the problem here.

What I want is the clear black-on-white look of the "scanned" pdf and the removal of all the "real" details (especially shadows) that are normal in a photo but should be absent in a printed page.


As @vanadium noticed in a comment, I am looking for a software solution that automatically cleans up pictures of a document, much alike Google Scan on a smartphone.

As @user535733 said in a comment, the problem here seems to be, at least to some extent, that of converting the greyscale (scanned/image) text to black-and-white.


Solution 1:

scantailor is not maintained anymore but you can still build it from source and use it.

However, the original repository needs qt4, which is not easily installable in recent Ubuntu versions. You can use e.g. this fork that has adapted to qt5.

Prerequisites:

sudo apt install libjpeg-dev zlib1g-dev libpng-dev libtiff-dev libboost-dev libxrender-dev libboost-all-dev

Installation:

git clone https://github.com/victl/scantailor
cd scantailor
cmake .
make
sudo make install

Disclaimer: I don't know the maintainer of this fork, and cannot say anything about the safety of his version.


Another option would be to use Scantailor advanced. You can install it via snap ...

sudo snap install scantailor-advanced

... or flatpak.

... or via ppa.

sudo add-apt-repository ppa:alex-p/scantailor
sudo apt update
sudo apt install scantailor # or scantailor-advanced

Quick test:

enter image description here

Solution 2:

As a direct solution on PDF (no manual image extraction):

Using ocrmypdf to restore OCR (as mentioned at the end of the complementary part of this answer) I have noticed that ocrmypdf -h shows an option which sounded like exactly what is asked:

--remove-background Attempt to remove background from gray or color pages, setting it to white

The initial pdf already had OCR, which gives an error unless one of the following options are used:

-f, --force-ocr Rasterize any text or vector objects on each page, apply OCR, and save the rastered output (this rewrites the PDF)

or

-s, --skip-text Skip OCR on any pages that already contain text, but include the page in final output; useful for PDFs that contain a mix of images, text pages, and/or previously OCRed pages

Applying each separately to one of my large files with hundreds of pages that already had OCR crashed the process.

The best solution seems to me to first print to pdf the initial file (which removes OCR), and then do

ocrmypdf input.pdf output.pdf -l <LANG> --remove-background -v

For English, the -l option is not needed. -v is for verbose details in terminal.

The resulted pdf is larger than the input (because of the --remove-background option): reduce the size as said below.


About Scan Tailor, as a complement to the main answer

Even its icon illustrates the fact that it is intended exactly for what is asked here:

![enter image description here

Here is how to use Scan Tailor with pdfs:

  1. Extract all pdf pages as image files - because this tool doesn't process pdf directly and needs images. Master PDF Editor can do this but on my machine it crashes after extracting about 80 images. But it can still be used by setting a new batch/range of pages to be extracted. (PDF Mod crashed before any processing). What I prefer after a few trials is a CLI reliable albeit slower method, with a command like: pdftoppm MY_PDF.pdf NAME -tiff - as said here. — Other variables can be used instead of tiff (which gives tif files), for example png or jpeg. See here a set of Dolphin service menu actions for the various extraction options:
[Desktop Entry]
Type=Service
ServiceTypes=KonqPopupMenu/Plugin
MimeType=application/pdf;
Actions=pdf;tif;jpeg;
X-KDE-Submenu=PDF action: EXTRACT ALL pages
Icon=application-pdf

[Desktop Action pdf]
Name=Extract pages as pdf
Icon=application-pdf
Exec=bash -c 'pdf=$(pdftk "%u" burst); kdialog --title "Extract pages" --msgbox "Extracted! $pdf";';

[Desktop Action tif]
Name=Extract pages as tif
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(pdftoppm "$f" "${f%%.*}" -tiff); kdialog --title "Extract pages" --msgbox "Extracted! $pdf";';


[Desktop Action jpeg]
Name=Extract pages as jpeg
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(pdftoppm "$f" "${f%%.*}" -jpeg); kdialog --title "Extract pages" --msgbox "Extracted! $pdf";';
  1. Load and process the resulting images in Scan Tailor. Put resulting image files in a separate folder and add that folder under New Project>Input Directory in Scan Tailor. (I have installed that program from PPA, as said in a comment by @N0rbert under the main answer.) Some pages containing real images and not text might look better if for each of them is selected "Grayscale and Color" instead of the default "Black and white" (meant here for text). Run one by one the listed procedures. Check the pages before running the last one ("Output").

enter image description here

  1. Create a new pdf out of the resulting images. (First check the resulted tif files are as you want them.) There are many ways to create a new pdf. Again the GUI tools that I've tried very soon crashed or gave odd results, so I prefer to put the resulting tif files in a separate folder and there run the command img2pdf *.tif -o out.pdf - as said here. (This may need proper naming/numbering of the files. More on that here.)

The resulting "tailored" pdf will be smaller than the initial one, but the percentage of the size reduction varies depending on factors that I ignore (but I imagine that the pages contained in the initial pdf should be extracted — at step 1 — in the format they already have; I think jpeg and tif should be used instead of png; use pdfimages -list your.pdf in terminal to see details on format, dpi and other details before processing with the commands above and below).

The final pdf can be further reduced with a command like:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

More details on that, here.

Here is a set of Dolphin service menu actions based on the above link:

[Desktop Entry]
Type=Service
ServiceTypes=KonqPopupMenu/Plugin
MimeType=application/pdf;
Actions=shrink;shrink0;shrink1;shrink2;
X-KDE-Submenu=PDF action: SHRINK
Icon=application-pdf

[Desktop Action shrink]
Name=Shrink pdf to "printer" size, 300dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/printer    -sOutputFile="${f%.pdf}_printer.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';

[Desktop Action shrink0]
Name=Shrink pdf to "prepress" size, 300dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress    -sOutputFile="${f%.pdf}_prepress.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';


[Desktop Action shrink1]
Name=Shrink pdf to "ebook size, 150dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook    -sOutputFile="${f%.pdf}_small.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';

[Desktop Action shrink2]
Name=Shrink pdf to "screen" size, 72dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/screen    -sOutputFile="${f%.pdf}_smaller.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';

I got some help from this answer too.


OCR (text search and copy capability) is lost during the above procedure, if present in the initial pdf. In order to get OCR, use ocrmypdf input.pdf output.pdf for English, as said here. For other languages, look for them with apt-cache search tesseract-ocr, and install them. Add -l <LANG> at the end of the command for specific languages; more here; see their names also here.

Here is a Dolphin service menu action for Romanian OCR with two options (one with progress in terminal and fixed output name, the other with background process but with output name based on input; I would like to have both process in terminal and output name based on input but don't know how; if someone can do it, please post here!). For English, replace "Romanian" and remove the -l ron variable:

[Desktop Entry]
Type=Service
ServiceTypes=KonqPopupMenu/Plugin
MimeType=application/pdf;
Actions=ocr1;ocr2;
X-KDE-Submenu=PDF action: apply OCR
Icon=application-pdf

[Desktop Action ocr1]
Name=Apply OCR Romanian (see progress in terminal; output name: ocr_ro.pdf!)
Icon=application-pdf
Exec=konsole --noclose -e ocrmypdf "%u" ocr_ro.pdf -l ron

[Desktop Action ocr2]
Name=Apply OCR Romanian (backgroud process: NO terminal! input>output name)
Icon=application-pdf
Exec=bash -c 'f="%u"; ocrmypdf "$f" "${f%.pdf}_ocr.pdf" -l ron;'

(Extracting and processing images, as well as 'printing as pdf' removes OCR, but reducing size with ghostscript as above does not, so the "shrinking" can be applied before or after the OCR.)

Solution 3:

I've got pretty good result using imageMagick and the following script http://www.fmwconcepts.com/imagemagick/shadowhighlight/index.php

Here is the result using the following parameters:

./shadowhighlight -ma 100 -sa 100 -ha 00 -hw 0 -bc 20 inputFile.png OutputFile.png

enter image description here