Copy pdf text layer to another pdf

Solution 1:

Here's a simple shell script to do this on the command-line:

Save this as ~/pdf-merge-text.sh (and chmod +x it):

#!/usr/bin/env bash

set -eu

pdf_merge_text() {
    local txtpdf; txtpdf="$1"
    local imgpdf; imgpdf="$2"
    local outpdf; outpdf="${3--}"
    if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi
    if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi
    if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi
    (
        local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)"
        trap "rm -f -- '${txtonlypdf//'/'\\''}'" EXIT
        gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}"
        pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}"
    )
}

pdf_merge_text "$@"

Now just call it:

~/pdf-merge-text.sh txt.pdf img.pdf out.pdf

The idea is to strip images from the OCR'd PDF, then merge it via the the technique in the answer above.

Solution 2:

This answer on stackoverflow has a solution. You can extract the text with coordinates from your pdf-2 using pdftotext -bbox or the Python package PDFMiner, then write this hidden text into a new PDF with the Python package ReportLab, then merge this hidden-text PDF with your pdf-1 using PDFtk (There's a GUI for Windows at the webpage; the command line for Unix is called PDFtk Server now.)

Or, you could try directly merging pdf-1 and pdf-2 using PDFtk. Run pdftk pdf-2 multistamp pdf-1 output out.pdf. This will put each page of pdf-1 in front of the corresponding page of pdf-2, so you will only see the images from pdf-1 (assuming they are scans, and do not have a transparent background), but the hidden text from pdf-2 will be included. The downside is that this may be very large, since it will include two copies of each page image. I have verified that this works, and the size of the output pdf is the sum of the sizes of the inputs.