How to extract text from pdf in script on Linux?
On Linux - How to extract text from a .pdf
in which text really is text, not a scanned image?
I want something I can use on the command line / in a script, not interactively.
(I don't want to convert to .tif
and use OCR - text is already available in the .pdf
file, so why introduce inaccuracies from imperfect OCR?)
pdftotext
that comes with poppler will try to extract any text found in the PDF.
Ignacio's answer is just fine. In fact, it'd be the first thing on my list. Well, that and perhaps to suggest the pdftohtml
tool that also comes with poppler, combined with pdfreflow if you want to try to reassemble the text into paragraphs, etc. (Of course, this will give you HTML output, but converting HTML to plain text can be done in many ways.)
Here are some other options too.
The ebook-convert
command line tool from Calibre, which can convert .PDFs to plain text (or RTF or a number of ebook formats, like ePub, etc.)
pdftxtextract
from Podofo
Abiword can be called from the commandline to convert between any formats it can input from/export to, and with the appropriate import plugin, this includes PDFs:
abiword --to=txt file.pdf
(In fairness, I think AbiWord and calibre both use the poppler libraries, but I'm not positive.)