Tools to extract text from powerpoint pptx in linux?
If you can process the files in bash
, this one-liner will unpack all the text:
unzip -qc "$1" ppt/slides/slide*.xml | grep -oP '(?<=\<a:t\>).*?(?=\</a:t\>)'
Just pass it the pptx file as $1
, and it will write the text into file $2
. The content of each slide will not appear in presentation order, and there will be no labels or anything, so you'll need a few more lines of script and a temp directory to get a more readable listing.
Since you have Abiword installed you can just make a PDF first
libreoffice --headless --convert-to pdf filename.pptx
And then use abiword to convert the pdf to txt
abiword --to=txt filename.pdf
If you add .zip at the end of the filename (i.e Presentation1.pptx.zip
) you can then unzip the document and view it's indvidual components.
In this resulting zip file there is the following directory \Presentation1.pptx.zip\ppt\slides
. This contaions .xml files named after each individual slide. If you open one of these files you will see that any entered text is wrapped in <a:t>
tags.
For example: <a:t>TEST</a:t>
This is as far as I can help you, but hopefully it's enough.
EDIT: As a side-note, the same process works for Word Documents as well. It's quite useful if you ever need to extract images from a Word Document.