How can I remove all images from a PDF?
Meanwhile the latest Ghostscript releases have a much nicer and easier to use method of removing all images from a PDF. The parameter to add to the command line is -dFILTERIMAGE
gs -o noimages.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
Even better, you can also remove all text or all vector drawing elements from a PDF by specifying -dFILTERTEXT
or -dFILTERVECTOR
.
Of course, you can also combine any combination of these -dFILTER*
parameters you want in order to achieve a required result. (Combining all three will of course result in "empty" pages.)
Here is the screenshot from an example PDF page which contains all 3 types of content mentioned above:
Screenshot of original PDF page containing "image", "vector" and "text" elements.
Running the following 6 commands will create all 6 possible variations of remaining contents:
gs -o noIMG.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf gs -o noTXT.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf gs -o noVCT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf gs -o onlyTXT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf gs -o onlyIMG.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT input.pdf gs -o onlyVCT.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERTEXT input.pdf
The following image illustrates the results:
Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.
I'm putting up the answer myself, but the actual code is by courtesy of Chris Liddell, Ghostscript developer.
I used his original PostScript code and stripped off its other functions. Only the function which removes raster images remains. Other graphical page objects -- text sections, patterns and vector objects -- should remain untouched.
Copy the following code and save it as remove-images.ps
:
%!PS
% Run as:
%
% gs ..... -dFILTERIMAGE -dDELAYBIND -dWRITESYSTEMDICT \
% ..... remove-images.ps <your-input-file>
%
% derived from Chris Liddell's original 'filter-obs.ps' script
% Adapted by @pdfkungfoo (on Twitter)
currentglobal true setglobal
32 dict begin
/debugprint { systemdict /DUMPDEBUG .knownget { {print flush} if}
{pop} ifelse } bind def
/pushnulldevice {
systemdict exch .knownget not
{
//false
} if
{
gsave
matrix currentmatrix
nulldevice
setmatrix
} if
} bind def
/popnulldevice {
systemdict exch .knownget not
{
//false
} if
{
% this is hacky - some operators clear the current point
% i.e.
{ currentpoint } stopped
{ grestore }
{ grestore moveto} ifelse
} if
} bind def
/sgd {systemdict exch get def} bind def
systemdict begin
/_image /image sgd
/_imagemask /imagemask sgd
/_colorimage /colorimage sgd
/image {
(\nIMAGE\n) //debugprint exec /FILTERIMAGE //pushnulldevice exec
_image
/FILTERIMAGE //popnulldevice exec
} bind def
/imagemask
{
(\nIMAGEMASK\n) //debugprint exec
/FILTERIMAGE //pushnulldevice exec
_imagemask
/FILTERIMAGE //popnulldevice exec
} bind def
/colorimage
{
(\nCOLORIMAGE\n) //debugprint exec
/FILTERIMAGE //pushnulldevice exec
_colorimage
/FILTERIMAGE //popnulldevice exec
} bind def
end
end
.bindnow
setglobal
Now run this command:
gs -o no-more-images-in-sample.pdf \
-sDEVICE=pdfwrite \
-dFILTERIMAGE \
-dDELAYBIND \
-dWRITESYSTEMDICT \
remove-images.ps \
sample.pdf
I tested the code with the official PDF specification, and it worked. The following two screenshots show page 750 of input and output PDFs:
If you wonder why something that looks like an image is still on the output page: it is not really a raster image, but a 'pattern' in the original file, and therefor it is not removed.