Reading the PDF properties/metadata in Python

Solution 1:

Try pdfminer:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

print(doc.info)  # The "Info" metadata

Here's the output:

>>> [{'CreationDate': 'D:20040520151901-0500',
  'Creator': 'DocBook XSL Stylesheets V1.52.2',
  'Keywords': 'Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free',
  'Producer': 'htmldoc 1.8.23 Copyright 1997-2002 Easy Software Products, All Rights Reserved.',
  'Title': 'Dive Into Python'}]

For more info, look at this tutorial: A lightweight XMP parser for extracting PDF metadata in Python.

Solution 2:

For Python 3 see PyPDF2 with example code from @Khaleel updated to:

from PyPDF2 import PdfFileReader
pdf_toread = PdfFileReader(open("test.pdf", "rb"))
pdf_info = pdf_toread.getDocumentInfo()
print(str(pdf_info))

Install using pip install PyPDF2.

Solution 3:

For Python 3 and new pdfminer (pip install pdfminer3k):

import os
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfparser import PDFDocument

fp = open("foo.pdf", 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
doc.set_parser(parser)
if len(doc.info) > 0:
    info = doc.info[0]
    print(info)

Solution 4:

Pointed out by Morten Zilmer: pyPdf homepage says it is no longer maintained.

I have implemented this using pyPdf. Please see the sample code below.

from pyPdf import PdfFileReader
pdf_toread = PdfFileReader(open("doc2.pdf", "rb"))
pdf_info = pdf_toread.getDocumentInfo()
print(str(pdf_info))

Output:

{'/Title': u'Microsoft Word - Agnico-Eagle - Complaint (00040197-2)', '/CreationDate': u"D:20111108111228-05'00'", '/Producer': u'Acrobat Distiller 10.0.0 (Windows)', '/ModDate': u"D:20111108112409-05'00'", '/Creator': u'PScript5.dll Version 5.2.2', '/Author': u'LdelPino'}

which one is the correct type of an array variable? (for using with pointers in C)

disable past dates on datepicker

Direct communication between Javascript in Jupyter and server via IPython kernel

ImportError with _event.cpython-310-x86_64-linux-gnu.so: undefined symbol: _PyGen_Send

How to clear last run query (Cache) in snowflake

expressjs optional url parameters

Django object unique id value in Template

I am getting this 'withNavigation can only be used on view hierarchy of a navigator' error when I'm trying to load a candlestick graph page

Azure devops : Approvals and checks on a repository level

Changing the background color of a disabled input field

How do I shift specific elements of a tensor with torch.roll?

How to fix 'Some files are not writable by WordPress' error, website cannot be changed by admin