How can I remove the gray-scale page background of a PDF document scan while preserving the text? (Binarization)

My PDF contains 600 pages with images of text. It has 2 layers.

  • Layer 1: Background colour image

  • Layer 2: Text image

I would like to remove all background image layers in the total PDF file as shown in the image.

enter image description here

Could you suggest me any software/tool?

enter image description here


Solution 1:

Overview

What you are looking for are tools like Scan Tailor and unpaper that are capable of Thresholding, Despeckling, and Noise Removal. Both tools work with images rather than PDF files but you can easily convert between the different formats these applications use and PDF by using the tools described at the end of this answer.

ScanTailor

You can find a video tutorial here. More extensive documentation is available on the official wiki. You will probably be most interested in the page on black and white output mode and filter settings.

Note: Since ~2016

This project is no longer maintained, and has not been maintained for a while.

Check ScanTailor Advanced.

Unpaper

I haven't worked with unpaper myself, yet. From what I understand it has far more features than ScanTailor but it's also much harder to master.

There is no GUI interface and you will have to rely on command line switches to get your work done. On the other hand this means that conversions with unpaper can easily be automated using scripts.

You can find some scripting examples concerning converting a scan to black and white and removing the background here.

Installation

This command will install all of the tools mentioned above:

sudo apt-get install scantailor unpaper poppler-utils libtiff-tools

Helpful tools when working with unpaper and ScanTailer

I don't have enough time to write up a full tutorial on ScanTailor and unpaper1 but here are some pointers concerning converting between .pdf and the image formats supported by these tools:

  • You can use pdfimages to convert PDF documents to single page .ppm files, which can be read by unpaper.

    Usage example:

    pdfimages *.pdf ./extracted-images
    
  • ScanTailor doesn't take .ppm files as an input. You will have to convert them to another format like the loss-less .pngfirst. mogrify out of the imagemagick tool suite can do this for you.

    Usage example:

    mogrify -format png *.ppm
    
  • The output format of ScanTailor and unpaper are single page .tiff files. In order to convert them back to .pdf I would suggest using tiffcp and tiff2pdf.

    Usage example:

    tiffcp *.tiff all.tiff
    tiff2pdf -F -p A4 -z -o Document.pdf all.tiff
    

1: To anyone reading this, please feel free to compile a more extensive answer based on ScanTailor and/or unpaper.

Solution 2:

gscan2pdf

I just found a very simple solution:

  1. Install gscan2pdf.
  2. Open gscan2pdf, and import the PDF.
  3. Tools->threshold. The default of 80% worked fine for me.

    Threshold: Changes all pixels darker than the given value to black; all others become white.

  4. Save the PDF in another location.

Solution 3:

OCRmyPDF

I tried Ocrmypdf to remove the gray background color, it worked for me. Command I tried:

ocrmypdf --use-threads \
         --remove-background \
         -v2 \
         --force-ocr \
         --optimize 3 \
         --output-type pdf \
         in.pdf out.pdf

Solution 4:

Imagemagick

You can use convert or mogrify (in place). Just one command:

mogrify -type grayscale -gamma "2" -normalize -contrast -contrast -contrast foo.pdf

Changed the mode to grayscale, gamma correction to 2, normalize (check also -auto-level and -equalize; -normalize is equivalent to -contrast-stretch 0.15x0.05%), some contrast.. The result with your example:

screenshot

This is with IMv6. With IMv7 all will be used with magick. A quote from https://imagemagick.org/script/porting.php#cli:

animate, compare, composite, conjure, convert, display, identify, import, mogrify, montage, stream

To reduce the footprint of the command-line utilities, these utilities are symbolic links to the magick utility. You can also invoke them from the magick utility, for example, use magick convert logo: logo.png to invoke the magick utility.
[...]
With the IMv7 parser, activated by the magick utility, settings are applied to each image in memory in turn (if any). While an option: only need to be applied once globally. Using the other utilities directly, or as an argument to the magick CLI (e.g. magick convert) utilizes the legacy parser.

Check also:

  • -threshold 60%, -trim, -background, -level 10%,90% -sharpen 0x1.

  • pdfsandwich: Tool to generate "sandwich" OCR pdf files.

  • Reduce PDF size:

    gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf