How to extract text from an existing docx file using python-docx

I'm trying to use python-docx module (pip install python-docx) but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class. Even though they are only showing how to add text to a docx file, not reading existing one?

1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:

from docx import Document

document = Document('test_doc.docx')
print(document.paragraphs)

It returned a list of <docx.text.Paragraph object at 0x... >

Then I did:

for p in document.paragraphs:
    print(p.text)

It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.

What is the issue? Why URLs are missing?

How could I get complete text without iterating over loop (something like open().read())

you can try this

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.

How to extract text from an existing docx file using python-docx

Related

Recent Posts