How can I programmatically change/protect text contents in a PDF so it's no longer copyable text?
I suppose that your goal is to make it more difficult to select and copy text from a PDF. Because that is the only achievable goal you can put up. (If there is a way to see the PDF pages on screen, then there is a way to somehow get access to the text or image contents, albeit a more difficult one than just to copy'n'paste... I guess you are aware of that.)
You have three options:
- Convert your PDF pages to full-page pixel images and wrap these images again into a multi-page PDF.
- Convert the glyphs of all embedded fonts into vector outlines.
- "Encrypt" your PDF with a user password.
Each of these 3 methods is very easy to apply, given the right tool. :-)
For each of these methods you can use a Free and Open Source Software tool on the command line. (Each of these tools is available for Linux, Mac OSX, Unix or Windows.)
See the more detailed discussion fore each method following below.
1. Create Full-Page Pixel Images (Using ImageMagick's convert
)
You can use ImageMagick's convert
command simply like this:
convert \
pdf-with-fonts.pdf \
pdf-with-images.pdf
ImageMagick can directly only work with raster images, not with any other format. Since it cannot process PDFs directly, it will automatically employ Ghostscript as its delegate. Hence Ghostscript needs to be installed too! Ghostscript will create the raster images needed as input by ImageMagick.
You can observe the process of ImageMagick using Ghostscript as a background process by adding -verbose
switch to your command line.
By default, convert
will use a resolution of 72ppi. This may be not good enough for reading well (but it will be much more difficult to circumvent your 'protection' by applying OCR software to the output.)
Above image shows a screenshot of a pixelized PDF page created with the default resolution (72 PPI) used by ImageMagick at 600% zooming level. If you need a better resolution, say 200 PPI, add the -density 200
parameter to the command line:
convert \
-density 200 \
pdf-with-fonts.pdf \
pdf-with-images.pdf
Above image shows the shot (also at 600% zooming level) of a pixelized PDF page created by ImageMagick with higher resolution of 200 PPI.
Note, when I tested the above command with the default resolution of 72 PPI, a 219-page PDF with all text and a size of 1 MByte resulted in a 23 MByte output file. It took about 2 minutes to generate on a MacBook. The 200ppi PDF resulted in 110 MByte and took 11 minutes to be ready...
Circumvent?
It is easy to circumvent pixelization of pages if the resolution is good enough: OCR will work just fine. With a low resolution it may still be readable (and guessable) for humans, but difficult for machines to come up with good OCR results.
2. Convert All Glyphs to Vector Outlines (Using Ghostscript)
You can use the newest and latest and greatest release of Ghostscript. That is version v9.15. Check your installed version with gs -version
.
The newest version v9.15 includes a new command line parameter, --dNoOutputFonts
. This parameter will convert all glyph shapes into outlines and remove all the embedded fonts:
gs \
-o pdf-with-outlines.pdf \
-sDEVICE=pdfwrite \
pdf-with-fonts.pdf
In my test, the same 219 page PDF (with a size of 1 MByte) converted into an output file of 186 MByte, taking 6 minutes to convert.
The advantage of outlines is that the page's text remains clear and sharp and un-pixelated, and you can zoom into the text at any level without loosing sharpness. You can see this in the next screenshot:
The disadvantage is the file's size. (BTW, I tested the same method, converting text to outlines, with Adobe Acrobat Pro XI, and the resulting file size was 61 MByte, taking 15 minutes of processing.)
Circumvent ?
It is easy to circumvent this measure: OCR will work just fine.
3. Protect PDF by 'encrypting' it (Using qpdf
)
What is not so well known, is that you can 'protect' or ('encrypt') a PDF with empty passwords (the 'user' as well as the 'owner' passwords). This allows all PDF reader/viewer software to open the file without asking for a password, only popping up the password dialog when trying to copy text from a page or when trying to print the file.
QPDF has quite good support for this:
qpdf \
--encrypt "" "" 40 \
--print=n \
--modify=n \
--extract=n \
-- \
uncrypted.pdf \
crypted.pdf
What do all these command options mean?
--encrypt "" "" 40
:
This sets both passwords (user and owner) to the empty string and the key length to 40 bits.--print=n
:
This disables printing of the PDF.--modify=n
:
This disables modification of the PDF.--extract=n
:
This disables text and image extraction of the PDF.--
:
This is required to signal the end of encryption options.
There are more (and different) detailed options available with QPDF if you use a 128 or 256 bit keylength.
Other available options include --modify=[annotate|form|assembly]
which would allow the filling of forms, adding of annotations or assembling the document with other PDFs (while at the same time still disallowing copy'n'paste or print).
This command
qpdf --show-encryption crypt.pdf
Will show the details about any file's 'encryption' settings. Example:
extract for accessibility: not allowed
extract for any purpose: not allowed
print low resolution: not allowed
print high resolution: not allowed
modify document assembly: not allowed
modify forms: allowed
modify annotations: allowed
modify other: not allowed
modify anything: not allowed
BTW: Leaving the password empty in the two dialogs shown above will not help with (most? or all? haven't tested...) PDF viewers. It still won't unlock to copy or print.
The advantage of this method lies in its fast execution and its almost identical file size.
Circumvent?
Of course, it is just as easy to remove the 'encryption' again:
qpdf --decrypt crypted.pdf decrypted.pdf
4. Summary
For fast results, identical file sizes, and easy-to-remove protection against 'casual' selection and copying of text, use 'protection'/'encryption' with an empty password.
For slow results, and potentially huge file sizes (but not always good looking pages) and a bit more-work-to-remove-the-protection, use pixelization for all pages.
For even slower results (but always better looking pages) and also more-work-to-remove-the-protection, use the vector outlining method of all text.
Always be aware, that all these methods do not absolutely protect the contents of your PDF pages. They only make it more inconvenient to extract.
It seems like your best bet is to use ImageMagick to convert the PDF to images of most any kind from the command line. This site provides some decent details on how to use PHP with ImageMagick to automate this task. But on a most fundamental level, this site explains the basics of doing so from the terminal command line:
convert my_great_file.pdf my_great_file.png
To change the output format from PNG to JPEG, just change the file extension of the destination filename like this:
convert my_great_file.pdf my_great_file.jpg
The trick in your case is you need to make sure your hosting service has ImageMagick installed. And if not, would they be willing to install it. In my experience, ImageMagick is such a commonly used tool that it should be there. The only potential “gotcha” is the install was not done with built in PDF support. But that is for you to figure out & deal with.
Also, if you create individual image files from the PDF, it’s easy enough to create a new PDF with just those image files like this:
convert *.jpg my_new_great_file.pdf
Also, it seems that convert
can do direct PDF to PDF conversion with the final product being rasterized by simply doing this:
convert my_great_file.pdf my_new_great_file.pdf
That would produce a PDF with rasterized output at 72 dpi. You can adjust that if need be by using the -density
option like this:
convert -density 144 my_great_file.pdf my_new_great_file.pdf
Which would create a new PDF comprised of rasterized images at 144 dpi.