How to identify the format of images in a pdf?
I have received a number of pdf files with images in them. The original images have been lost, so I need to extract them. I have Adobe Acrobat Pro, so I extracted them using Advanced > Document Processing > Export All Images
(there are four options: jpeg, png, tiff, jpeg2000). But, I'd like to extract them in the original format, and this is apparently not jpeg: I also tested pdfimages.exe from xpdf as outlined here, and this gave .ppm files, not jpeg.
So I tried ImageMagick's identify, what it gave me was this:
identify images-000.ppm
images-000.ppm PPM 870x1181 870x1181+0+0 8-bit sRGB 3.082MB 0.000u 0:00.000
Does this indicate it was an embedded .bmp? How to tell? I would actually expect a function in Acrobat to identify the format of images, but I couldn't find it.
So, what is the best way to identify the image format of images in a pdf?
(I prefer extraction via Acrobat because of the batch functionality).
Solution 1:
AFAIK, the Image XObjects embedded inside PDFs do not store any information about the original image format. At most if it's an embedded JPEG it can be extracted as-is, but for all other cases you end up with a PxM image that you'll need to convert.
Solution 2:
The picture is in portable pixmap file format. (See Wikipedia: Netpbm format for details).
The can use the netbmp tools to convert these to a more modern bmp.
The syntax for that is: ppmtobmp images-000.ppm > images-000.bmp
.
http://netpbm.sourceforge.net/ is the homepage for netpbm.
Are there multiple images in a document? Or can we just search the PDF for the line with identify images-000.ppm
, cut the file from that location and feed it to ppmtobmp? It should not be hard to automate that.
Solution 3:
pdfimage --list
With pdfimage --list myfile.pdf
you may read in the column enc the original encoding.
In the case of the example below, taken from a PDF file generated by a scanner with text (no color) 300BPI resolution images, you may read jbig2:
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
-------------------------------------------------------------------------------------------
1 0 image 2340 1654 gray 1 1 jbig2 no 20 0 283 142 39.7K 8.4%
2 1 image 2340 1654 gray 1 1 jbig2 no 25 0 283 142 41.5K 8.8%
3 2 image 2340 1654 gray 1 1 jbig2 no 30 0 283 142 43.1K 9.1%
4 3 image 2340 1654 gray 1 1 jbig2 no 35 0 283 142 46.9K 10%
In this case the format is jbig2; from the manual (help) you can read
-jbig2
Write images in JBIG2 format as JBIG2 files instead of the default format. JBIG2 data in PDF is of the embedded type. The embedded type of JBIG2 has an optional separate file containing global data. The embedded data is written with the extension .jb2e and the global data (if available) will be written to the same image number with the extension .jb2g. The content of both these files is identical to the JBIG2 data in the PDF.
You can extract them with the command
pdfimages myfile.pdf -jbig2 A
Note. A
is the base for the name of the extracted images.
You will obtain the files A-000.jb2e
, A-001.jb2e
...
Of course for other format you should use the relative option -png
, -tiff
...
Automatic extraction: -all
option
pdfimages MyFile.pdf -all B
-all
Write JPEG, JPEG2000, JBIG2, and CCITT images in their native format. CMYK files are written as TIFF files. All other images are written as PNG files. This is equivalent to specifying the options -png -tiff -j -jp2 -jbig2 -ccitt.
In this case B
is the base for the name of the extracted images.
Note. However, you may need to see the --list
output to understand which PNG output file was originally encoded as PNG and which converted to PNG.