How to extract text from a PDF file?
I'm trying to extract the text included in this PDF file using Python
.
I'm using the PyPDF2 module, and have the following script:
import PyPDF2
pdf_file = open('sample.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content
When I run the code, I get the following output which is different from that included in the PDF document:
!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%
How can I extract the text as is in the PDF document?
Solution 1:
I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
from tika import parser # pip install tika
raw = parser.from_file('sample.pdf')
print(raw['content'])
Note that Tika is written in Java so you will need a Java runtime installed
Solution 2:
Use textract.
- http://textract.readthedocs.io/en/latest/
- https://github.com/deanmalmgren/textract
It supports many types of files including PDFs
import textract
text = textract.process("path/to/file.extension")
Solution 3:
I recommend to use pymupdf or pdfminer.six
.
Those packages are not maintained:
- PyPDF2, PyPDF3, PyPDF4
-
pdfminer
(without .six)
How to read pure text with pymupdf
There are different options which will give different results, but the most basic one is:
import fitz # this is pymupdf
with fitz.open("my.pdf") as doc:
text = ""
for page in doc:
text += page.getText()
print(text)
Other PDF libraries
- pikepdf does not support text extraction (source)
Solution 4:
Look at this code:
import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')
The output is:
!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%
Using the same code to read a pdf from 201308FCR.pdf .The output is normal.
Its documentation explains why:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""