How to programmatically determine DPI of images in PDF file?

Solution 1:

Main answer

Since I am interested in the same kind of job (though not necessarily to OCR the PDF files, but to convert them to DjVu and then OCR them), I found this question and the responses lacking (since I needed to guess the DPI of the images with the number of pixels and then use the size as output by pdfinfo or other tricks---not to mention that the images inside a PDF may have different densities etc.).

After a lot of research more, I found that you can use pdfimages (from package poppler-utils) like the following:

$ pdfimages -list deptest.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   1  image  no         9  0    53    53  169B  14%
   2     1 image     100   100  gray    1   1  ccitt  no   [inline]      53    53  698B  56%

Notice the x-ppi and y-ppi at the listing above. It also lists the format in which the images are stored in the PDF, which is cool (sometimes, it is JBIG2, sometimes JPEG2000 etc.)

Note: The file deptest.pdf used above is available from pdfsizeopt's repository.

The real action

After that, you can simply extract the images with pdfimages itself or use pdftoppm (also from poppler-utils) to render entire pages in many formats that you may like (e.g., tiff, for scanning with tesseract).

You can use something like the following (assuming you have created a directory named imgs where you will put your images):

pdfimages -png Faraway-PRA.pdf imgs/prefix

The files will be created inside the directory imgs with names starting with prefix, as in:

$ ls 
prefix-000.png  prefix-047.png  prefix-094.png  prefix-141.png
prefix-001.png  prefix-048.png  prefix-095.png  prefix-142.png
prefix-002.png  prefix-049.png  prefix-096.png  prefix-143.png
prefix-003.png  prefix-050.png  prefix-097.png  prefix-144.png
(...)

You can, then, perform any surgery that you see fit with tools like scantailor or whatever you like.

More direct answer

If you just want to OCR a PDF file, you can use a program that is well-maintained and already packaged, namely ocrmypdf.

Solution 2:

I needed this information and just found it here:

http://www.wizards-toolkit.org/discourse-server/viewtopic.php?t=16110

This technique also uses ImageMagick:

identify -format "%w x %h %x x %y" DAT_1.tif

The output is the size of the image and the dpi:

2480 x 3507 300 x 300

Solution 3:

I use the following command:

convert MyPDF.pdf -print "Size: %wx%h\n" /dev/null

and it returns:

Size: 380x380