Read .doc file with python

Solution 1:

One can use the textract library. It take care of both "doc" as well as "docx"

import textract
text = textract.process("path/to/file.extension")

You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

antiword filename.doc > filename.docx

Ultimately, textract in the backend is using antiword.

Solution 2:

You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images.

You can install it by running: pip install docx2txt.

Let's download and read the first Microsoft document on here:

import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)

Here is a screenshot of the Terminal output the above code:

enter image description here

EDIT:

This does NOT work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files.