How to extract text from pdf in script on Linux?

On Linux - How to extract text from a .pdf in which text really is text, not a scanned image? I want something I can use on the command line / in a script, not interactively. (I don't want to convert to .tif and use OCR - text is already available in the .pdf file, so why introduce inaccuracies from imperfect OCR?)


pdftotext that comes with poppler will try to extract any text found in the PDF.


Ignacio's answer is just fine. In fact, it'd be the first thing on my list. Well, that and perhaps to suggest the pdftohtml tool that also comes with poppler, combined with pdfreflow if you want to try to reassemble the text into paragraphs, etc. (Of course, this will give you HTML output, but converting HTML to plain text can be done in many ways.)

Here are some other options too.

The ebook-convert command line tool from Calibre, which can convert .PDFs to plain text (or RTF or a number of ebook formats, like ePub, etc.)

pdftxtextract from Podofo

Abiword can be called from the commandline to convert between any formats it can input from/export to, and with the appropriate import plugin, this includes PDFs:

abiword --to=txt file.pdf

(In fairness, I think AbiWord and calibre both use the poppler libraries, but I'm not positive.)