How can I remove the gray-scale page background of a PDF document scan while preserving the text? (Binarization)
My PDF contains 600 pages with images of text. It has 2 layers.
Layer 1: Background colour image
Layer 2: Text image
I would like to remove all background image layers in the total PDF file as shown in the image.
Could you suggest me any software/tool?
Solution 1:
Overview
What you are looking for are tools like Scan Tailor and unpaper that are capable of Thresholding, Despeckling, and Noise Removal. Both tools work with images rather than PDF files but you can easily convert between the different formats these applications use and PDF by using the tools described at the end of this answer.
ScanTailor
You can find a video tutorial here. More extensive documentation is available on the official wiki. You will probably be most interested in the page on black and white output mode and filter settings.
Note: Since ~2016
This project is no longer maintained, and has not been maintained for a while.
Check ScanTailor Advanced.
Unpaper
I haven't worked with unpaper
myself, yet. From what I understand it has far more features than ScanTailor but it's also much harder to master.
There is no GUI interface and you will have to rely on command line switches to get your work done. On the other hand this means that conversions with unpaper
can easily be automated using scripts.
You can find some scripting examples concerning converting a scan to black and white and removing the background here.
Installation
This command will install all of the tools mentioned above:
sudo apt-get install scantailor unpaper poppler-utils libtiff-tools
Helpful tools when working with unpaper and ScanTailer
I don't have enough time to write up a full tutorial on ScanTailor and unpaper1 but here are some pointers concerning converting between .pdf
and the image formats supported by these tools:
-
You can use
pdfimages
to convert PDF documents to single page.ppm
files, which can be read byunpaper
.Usage example:
pdfimages *.pdf ./extracted-images
-
ScanTailor doesn't take
.ppm
files as an input. You will have to convert them to another format like the loss-less.png
first.mogrify
out of theimagemagick
tool suite can do this for you.Usage example:
mogrify -format png *.ppm
-
The output format of ScanTailor and unpaper are single page
.tiff
files. In order to convert them back to.pdf
I would suggest usingtiffcp
andtiff2pdf
.Usage example:
tiffcp *.tiff all.tiff tiff2pdf -F -p A4 -z -o Document.pdf all.tiff
1: To anyone reading this, please feel free to compile a more extensive answer based on ScanTailor and/or unpaper.
Solution 2:
gscan2pdf
I just found a very simple solution:
- Install
gscan2pdf
. - Open
gscan2pdf
, and import the PDF. - Tools->threshold. The default of 80% worked fine for me.
Threshold: Changes all pixels darker than the given value to black; all others become white.
- Save the PDF in another location.
Solution 3:
OCRmyPDF
I tried Ocrmypdf to remove the gray background color, it worked for me. Command I tried:
ocrmypdf --use-threads \
--remove-background \
-v2 \
--force-ocr \
--optimize 3 \
--output-type pdf \
in.pdf out.pdf
Solution 4:
Imagemagick
You can use convert
or mogrify
(in place). Just one command:
mogrify -type grayscale -gamma "2" -normalize -contrast -contrast -contrast foo.pdf
Changed the mode to grayscale, gamma correction to 2, normalize (check also -auto-level
and -equalize
; -normalize
is equivalent to -contrast-stretch 0.15x0.05%
), some contrast.. The result with your example:
This is with IMv6. With IMv7 all will be used with magick
. A quote from https://imagemagick.org/script/porting.php#cli:
animate
,compare
,composite
,conjure
,convert
,display
,identify
,import
,mogrify
,montage
,stream
To reduce the footprint of the command-line utilities, these utilities are symbolic links to the
magick
utility. You can also invoke them from themagick
utility, for example, usemagick convert logo: logo.png
to invoke themagick
utility.
[...]
With the IMv7 parser, activated by themagick
utility, settings are applied to each image in memory in turn (if any). While an option: only need to be applied once globally. Using the other utilities directly, or as an argument to themagick
CLI (e.g.magick convert
) utilizes the legacy parser.
Check also:
-
-threshold 60%
,-trim
,-background
,-level 10%,90% -sharpen 0x1
. -
pdfsandwich: Tool to generate "sandwich" OCR pdf files.
-
Reduce PDF size:
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf