How to extract the text from MS Office documents in Linux?

Solution 1:

Catdoc can convert doc,xls & ppt to text. Second option would be wvWare.

For more utils check check http://www.linux.com/archive/articles/52385 for word to text coverters and

Solution 2:

Abiword can convert from the commandline between any file formats it knows.

Convert from Word to plain text:

abiword --to=txt myfile.doc

Make a pdf from a Word file:

abiword --to=pdf myfile.doc

And so on. The results in these cases would be myfile.txt or myfile.pdf. If you want to specify the output name you can do that too:

abiword --to=txt --to-name=output.txt myfile.doc

Convert ODT to Word:

abiword --to=doc myfile.odt

Convert Word to ODT:

abiword --to=odt myfile.doc

In fairness to other answers, it should be noted that AbiWord uses wvWare to handle Word documents, but even the wvWare homepage recommends using AbiWord instead for most conversions.

I hate word processors. This is the main reason I have AbiWord installed.

You might also be interested in unoconv, which is a similar tool supporting formats OpenOffice knows (which would include spreadsheets and the like), but I have no experience with it personally.

Solution 3:

I finally found the perfect tool for scripting document parsing , it is apache-tika , it can parse gazillion non-text formats into text which is very cool!

Get Apache Tika here:

http://tika.apache.org/

(Mac Homebrew users: brew install tika)

The command-line interface works like this:

tika --text something.docx > something.txt