Best way to extract text from a Word doc without using COM/automation?

Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)

Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.

A Python solution would be ideal, but doesn't appear to be available.


Solution 1:

(Same answer as extracting text from MS word files in python)

Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:

document = opendocx('Hello world.docx')

# This location is where most document content lives 
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]

# Extract all text
print getdocumenttext(document)

See Python DocX site

100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs.

Solution 2:

I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

import os

def doc_to_text_catdoc(filename):
    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
    fi.close()
    retval = fo.read()
    erroroutput = fe.read()
    fo.close()
    fe.close()
    if not erroroutput:
        return retval
    else:
        raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

The -w switch to catdoc turns off line wrapping, BTW.

Solution 3:

If all you want to do is extracting text from Word files (.docx), it's possible to do it only with Python. Like Guy Starbuck wrote it, you just need to unzip the file and then parse the XML. Inspired by python-docx, I have written a simple function to do this:

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile


"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [node.text
                 for node in paragraph.getiterator(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

Solution 4:

Using the OpenOffice API, and Python, and Andrew Pitonyak's excellent online macro book I managed to do this. Section 7.16.4 is the place to start.

One other tip to make it work without needing the screen at all is to use the Hidden property:

RO = PropertyValue('ReadOnly', 0, True, 0)
Hidden = PropertyValue('Hidden', 0, True, 0)
xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )

Otherwise the document flicks up on the screen (probably on the webserver console) when you open it.

Solution 5:

tika-python

A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.

Note: It also works charmingly with pyinstaller

Install with pip :

pip install tika

Sample:

#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file

Link to official GitHub