How to remove images from a PDF file

I've got a rather large (~100MB) PDF document with lots of images in it (as illustrations and background images), and I'd like to have a copy of that pdf without images but I can't find out how to do that.

I'm not talking about converting it to text only, I'd like to keep paragraphs/tables/multi-columns as they are.

I'm comfortable with command line and have several computers with different distributions that I can use.


Solution 1:

The latest releases of Ghostscript can do this too. Just add the parameter -dFILTERIMAGE to your command.

The are even two more new parameters which can be added in order to selectively remove content types "vector" and "text":

  1. -dFILTERIMAGE: produces an output where all raster images are removed.

  2. -dFILTERTEXT: produces an output where all text elements are removed.

  3. -dFILTERVECTOR: produces an output where all vector drawings are removed.

Any two of these options can be combined. (If you combine all 3, you'll get all pages getting blanked...)

Examples

Here is the screenshot from an example PDF page which contains all 3 types of content mentioned above:

Screenshot of original PDF page containing "image", "vector" and "text" elements.
Screenshot of original PDF page containing "image", "vector" and "text" elements.


Running the following 6 commands will create all 6 possible variations of remaining contents:

 gs -o noIMG.pdf   -sDEVICE=pdfwrite -dFILTERIMAGE                input.pdf
 gs -o noTXT.pdf   -sDEVICE=pdfwrite -dFILTERTEXT                 input.pdf
 gs -o noVCT.pdf   -sDEVICE=pdfwrite -dFILTERVECTOR               input.pdf

 gs -o onlyIMG.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT  input.pdf
 gs -o onlyTXT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf
 gs -o onlyVCT.pdf -sDEVICE=pdfwrite -dFILTERIMAGE  -dFILTERTEXT  input.pdf

The following image illustrates the results:


Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.
Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.


Solution 2:

cpdf -draft original.pdf -o version_without_images.pdf

It is not in the repositories but you can find a download (pre-compiled or source) on their website.


Manual:

15.1 Draft Documents

The -draft option removes bitmap (photographic) images from a file, so that it can be printed with less ink. Optionally, the -boxes option can be added, filling the spaces left blank with a crossed box denoting where the image was. This is not guaranteed to be fully visible in all cases (the bitmap may be have been partially covered by vector objects or clipped in the original). For example:

 cpdf -draft -boxes in.pdf -o out.pdf