How to remove images from a PDF file
I've got a rather large (~100MB) PDF document with lots of images in it (as illustrations and background images), and I'd like to have a copy of that pdf without images but I can't find out how to do that.
I'm not talking about converting it to text only, I'd like to keep paragraphs/tables/multi-columns as they are.
I'm comfortable with command line and have several computers with different distributions that I can use.
Solution 1:
The latest releases of Ghostscript can do this too. Just add the parameter -dFILTERIMAGE
to your command.
The are even two more new parameters which can be added in order to selectively remove content types "vector" and "text":
-dFILTERIMAGE
: produces an output where all raster images are removed.-dFILTERTEXT
: produces an output where all text elements are removed.-dFILTERVECTOR
: produces an output where all vector drawings are removed.
Any two of these options can be combined. (If you combine all 3, you'll get all pages getting blanked...)
Examples
Here is the screenshot from an example PDF page which contains all 3 types of content mentioned above:
Screenshot of original PDF page containing "image", "vector" and "text" elements.
Running the following 6 commands will create all 6 possible variations of remaining contents:
gs -o noIMG.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf gs -o noTXT.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf gs -o noVCT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf gs -o onlyIMG.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT input.pdf gs -o onlyTXT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf gs -o onlyVCT.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERTEXT input.pdf
The following image illustrates the results:
Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.
Solution 2:
cpdf -draft original.pdf -o version_without_images.pdf
It is not in the repositories but you can find a download (pre-compiled or source) on their website.
Manual:
15.1 Draft Documents
The -draft option removes bitmap (photographic) images from a file, so that it can be printed with less ink. Optionally, the -boxes option can be added, filling the spaces left blank with a crossed box denoting where the image was. This is not guaranteed to be fully visible in all cases (the bitmap may be have been partially covered by vector objects or clipped in the original). For example:
cpdf -draft -boxes in.pdf -o out.pdf