Tools to extract text from powerpoint pptx in linux?

If you can process the files in bash, this one-liner will unpack all the text:

unzip -qc "$1" ppt/slides/slide*.xml | grep -oP '(?<=\<a:t\>).*?(?=\</a:t\>)'

Just pass it the pptx file as $1, and it will write the text into file $2. The content of each slide will not appear in presentation order, and there will be no labels or anything, so you'll need a few more lines of script and a temp directory to get a more readable listing.

Since you have Abiword installed you can just make a PDF first

libreoffice --headless --convert-to pdf filename.pptx

And then use abiword to convert the pdf to txt

abiword --to=txt filename.pdf

If you add .zip at the end of the filename (i.e Presentation1.pptx.zip) you can then unzip the document and view it's indvidual components.

In this resulting zip file there is the following directory \Presentation1.pptx.zip\ppt\slides. This contaions .xml files named after each individual slide. If you open one of these files you will see that any entered text is wrapped in <a:t> tags.

For example: <a:t>TEST</a:t>

This is as far as I can help you, but hopefully it's enough.

EDIT: As a side-note, the same process works for Word Documents as well. It's quite useful if you ever need to extract images from a Word Document.

Tools to extract text from powerpoint pptx in linux?

Related

Recent Posts