extracting text from MS word files in python

for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux? Is there any library?

Use the native Python docx module. Here's how to extract all the text from a doc:

document = docx.Document(filename)
docText = '\n\n'.join(
    paragraph.text for paragraph in document.paragraphs
)
print(docText)

See Python DocX site

Also check out Textract which pulls out tables etc.

Parsing XML with regexs invokes cthulu. Don't do it!

You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

extracting text from MS word files in python

Related

Recent Posts