How to identify the format of images in a pdf?

I have received a number of pdf files with images in them. The original images have been lost, so I need to extract them. I have Adobe Acrobat Pro, so I extracted them using Advanced > Document Processing > Export All Images (there are four options: jpeg, png, tiff, jpeg2000). But, I'd like to extract them in the original format, and this is apparently not jpeg: I also tested pdfimages.exe from xpdf as outlined here, and this gave .ppm files, not jpeg.

So I tried ImageMagick's identify, what it gave me was this:

identify images-000.ppm
images-000.ppm PPM 870x1181 870x1181+0+0 8-bit sRGB 3.082MB 0.000u 0:00.000

Does this indicate it was an embedded .bmp? How to tell? I would actually expect a function in Acrobat to identify the format of images, but I couldn't find it.

So, what is the best way to identify the image format of images in a pdf?

(I prefer extraction via Acrobat because of the batch functionality).

Solution 1:

AFAIK, the Image XObjects embedded inside PDFs do not store any information about the original image format. At most if it's an embedded JPEG it can be extracted as-is, but for all other cases you end up with a PxM image that you'll need to convert.

Solution 2:

The picture is in portable pixmap file format. (See Wikipedia: Netpbm format for details).

The can use the netbmp tools to convert these to a more modern bmp.
The syntax for that is: ppmtobmp images-000.ppm > images-000.bmp.

http://netpbm.sourceforge.net/ is the homepage for netpbm.

Are there multiple images in a document? Or can we just search the PDF for the line with identify images-000.ppm, cut the file from that location and feed it to ppmtobmp? It should not be hard to automate that.

Solution 3:

`pdfimage --list`

With pdfimage --list myfile.pdf you may read in the column enc the original encoding.
In the case of the example below, taken from a PDF file generated by a scanner with text (no color) 300BPI resolution images, you may read jbig2:

page  num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
-------------------------------------------------------------------------------------------
   1    0 image    2340  1654  gray    1   1  jbig2  no        20  0   283   142 39.7K 8.4%
   2    1 image    2340  1654  gray    1   1  jbig2  no        25  0   283   142 41.5K 8.8%
   3    2 image    2340  1654  gray    1   1  jbig2  no        30  0   283   142 43.1K 9.1%
   4    3 image    2340  1654  gray    1   1  jbig2  no        35  0   283   142 46.9K  10%

In this case the format is jbig2; from the manual (help) you can read

-jbig2
Write images in JBIG2 format as JBIG2 files instead of the default format. JBIG2 data in PDF is of the embedded type. The embedded type of JBIG2 has an optional separate file containing global data. The embedded data is written with the extension .jb2e and the global data (if available) will be written to the same image number with the extension .jb2g. The content of both these files is identical to the JBIG2 data in the PDF.

You can extract them with the command

pdfimages myfile.pdf -jbig2 A

Note. A is the base for the name of the extracted images. You will obtain the files A-000.jb2e, A-001.jb2e... Of course for other format you should use the relative option -png, -tiff...

Automatic extraction: `-all` option

pdfimages MyFile.pdf -all B

-all
Write JPEG, JPEG2000, JBIG2, and CCITT images in their native format. CMYK files are written as TIFF files. All other images are written as PNG files. This is equivalent to specifying the options -png -tiff -j -jp2 -jbig2 -ccitt.

In this case B is the base for the name of the extracted images.

Note. However, you may need to see the --list output to understand which PNG output file was originally encoded as PNG and which converted to PNG.

How to identify the format of images in a pdf?

Solution 1:

Solution 2:

Solution 3:

`pdfimage --list`

Automatic extraction: `-all` option

Related

Recent Posts

How to identify the format of images in a pdf?

Solution 1:

Solution 2:

Solution 3:

pdfimage --list

Automatic extraction: -all option

Related

Recent Posts

`pdfimage --list`

Automatic extraction: `-all` option