How to programmatically determine DPI of images in PDF file?
Solution 1:
Main answer
Since I am interested in the same kind of job (though not necessarily to OCR the PDF files, but to convert them to DjVu and then OCR them), I found this question and the responses lacking (since I needed to guess the DPI of the images with the number of pixels and then use the size as output by pdfinfo
or other tricks---not to mention that the images inside a PDF may have different densities etc.).
After a lot of research more, I found that you can use pdfimages
(from package poppler-utils) like the following:
$ pdfimages -list deptest.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 100 100 gray 1 1 image no 9 0 53 53 169B 14%
2 1 image 100 100 gray 1 1 ccitt no [inline] 53 53 698B 56%
Notice the x-ppi
and y-ppi
at the listing above. It also lists the format in which the images are stored in the PDF, which is cool (sometimes, it is JBIG2, sometimes JPEG2000 etc.)
Note: The file deptest.pdf
used above is available from pdfsizeopt
's repository.
The real action
After that, you can simply extract the images with pdfimages
itself or use pdftoppm
(also from poppler-utils
) to render entire pages in many formats that you may like (e.g., tiff, for scanning with tesseract
).
You can use something like the following (assuming you have created a directory named imgs
where you will put your images):
pdfimages -png Faraway-PRA.pdf imgs/prefix
The files will be created inside the directory imgs
with names starting with prefix
, as in:
$ ls
prefix-000.png prefix-047.png prefix-094.png prefix-141.png
prefix-001.png prefix-048.png prefix-095.png prefix-142.png
prefix-002.png prefix-049.png prefix-096.png prefix-143.png
prefix-003.png prefix-050.png prefix-097.png prefix-144.png
(...)
You can, then, perform any surgery that you see fit with tools like scantailor
or whatever you like.
More direct answer
If you just want to OCR a PDF file, you can use a program that is well-maintained and already packaged, namely ocrmypdf.
Solution 2:
I needed this information and just found it here:
http://www.wizards-toolkit.org/discourse-server/viewtopic.php?t=16110
This technique also uses ImageMagick:
identify -format "%w x %h %x x %y" DAT_1.tif
The output is the size of the image and the dpi:
2480 x 3507 300 x 300
Solution 3:
I use the following command:
convert MyPDF.pdf -print "Size: %wx%h\n" /dev/null
and it returns:
Size: 380x380