Extract a page from a pdf as a jpeg
Solution 1:
The pdf2image library can be used.
You can install it simply using,
pip install pdf2image
Once installed you can use following code to get images.
from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)
Saving pages in jpeg format
for page in pages:
page.save('out.jpg', 'JPEG')
Edit: the Github repo pdf2image also mentions that it uses pdftoppm
and that it requires other installations:
pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install poppler for Windows. Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run
sudo apt install poppler-utils
.
You can install the latest version under Windows using anaconda by doing:
conda install -c conda-forge poppler
note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.
Solution 2:
I found this simple solution, PyMuPDF, output to png file. Note the library is imported as "fitz", a historical name for the rendering engine it uses.
import fitz
pdffile = "infile.pdf"
doc = fitz.open(pdffile)
page = doc.loadPage(0) # number of page
pix = page.get_pixmap()
output = "outfile.png"
pix.save(output)
Solution 3:
The Python library pdf2image
(used in the other answer) in fact doesn't do much more than just launching pdttoppm
with subprocess.Popen
, so here is a short version doing it directly:
PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"
import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))
Here is the Windows installation link for pdftoppm
(contained in a package named poppler): http://blog.alivate.com.au/poppler-windows/.
Solution 4:
There is no need to install Poppler on your OS. This will work:
pip install Wand
from wand.image import Image
f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source:
for i, image in enumerate(source.sequence):
newfilename = f[:-4] + str(i + 1) + '.jpeg'
Image(image).save(filename=newfilename)
Solution 5:
@gaurwraith, install poppler for Windows and use pdftoppm.exe as follows:
Download zip file with Poppler's latest binaries/dlls from http://blog.alivate.com.au/poppler-windows/ and unzip to a new folder in your program files folder. For example: "C:\Program Files (x86)\Poppler".
Add "C:\Program Files (x86)\Poppler\poppler-0.68.0\bin" to your SYSTEM PATH environment variable.
From cmd line install pdf2image module -> "pip install pdf2image".
- Or alternatively, directly execute pdftoppm.exe from your code using Python's subprocess module as explained by user Basj.
@vishvAs vAsuki, this code should generate the jpgs you want through the subprocess module for all pages of one or more pdfs in a given folder:
import os, subprocess
pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)
pdftoppm_path = r"C:\Program Files (x86)\Poppler\poppler-0.68.0\bin\pdftoppm.exe"
for pdf_file in os.listdir(pdf_dir):
if pdf_file.endswith(".pdf"):
subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))
Or using the pdf2image module:
import os
from pdf2image import convert_from_path
pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)
for pdf_file in os.listdir(pdf_dir):
if pdf_file.endswith(".pdf"):
pages = convert_from_path(pdf_file, 300)
pdf_file = pdf_file[:-4]
for page in pages:
page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")