How can I determine the page count of odt, doc, docx and other office documents from the CLI?

It's easy to find the page count of a PDF document from the command line:

pdfinfo sample.pdf | grep ^Pages:

... but I haven't been able to find a similar method for odt files and other office documents.

Is there a way to programmatically determine the page count of these documents?

Thanks for all the answers, everyone. With your help I was able to compile a list of commands that can extract the page count from almost all relevant office documents:

DOCX/PPTX

unzip -p 'sample.docx' docProps/app.xml | grep -oP '(?<=\<Pages\>).*(?=\</Pages\>)'

unzip -p 'sample.pptx' docProps/app.xml | grep -oP '(?<=\<Slides\>).*(?=\</Slides\>)'

Note: unzip can be installed with sudo apt-get install unzip.

DOC/PPT

wvSummary sample.doc | grep -oP '(?<=of Pages = )[ A-Za-z0-9]*'

wvSummary sample.ppt | grep -oP '(?<=of Slides = )[ A-Za-z0-9]*'

Note: wvSummary (case-sensitive!) is part of the wv package. Install it with sudo apt-get install wv.

ODT

unzip -p sample.odt meta.xml | grep -oP '(?<=page-count=")[ A-Za-z0-9]*'

PDF

pdfinfo sample.pdf | grep -oP '(?<=Pages:          )[ A-Za-z0-9]*'

Note: pdfinfo is part of poppler-utils and should come preinstalled on Ubuntu.

DJVU

djvused -e "n" sample.djvu

Note: djvused is part of the djvulibre-bin package and may be installed with sudo apt-get install djvulibre-bin.

I didn't find a way to extract odt file info as pdfinfo does, but you can create a fast script to use pdfinfo with the odt files, converting each odt file to PDF and later deleting the converted file if you are not going to use it:

libreoffice --headless --invisible --convert-to pdf sample.odt
pdfinfo sample.pdf | grep ^Pages:
rm sample.pdf

Hope that this helped you.

How can I determine the page count of odt, doc, docx and other office documents from the CLI?

Related

Recent Posts