extracting text from MS word files in python
for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux? Is there any library?
Use the native Python docx module. Here's how to extract all the text from a doc:
document = docx.Document(filename)
docText = '\n\n'.join(
paragraph.text for paragraph in document.paragraphs
)
print(docText)
See Python DocX site
Also check out Textract which pulls out tables etc.
Parsing XML with regexs invokes cthulu. Don't do it!
You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.