How to extract images from Word Document from Linux
Since docx files are zip files you can unzip the docx file and then pick out the image files.
I have no Microsoft Office to test so I downloaded some random docx files from the internet. It seems that the images are always stored in a word/media
directory in the archive.
This command will extract all files from the media
directory from the archive:
unzip foo.docx "word/media/*"
This command will extract only *.jpeg
files:
unzip foo.docx "*.jpeg"
Note that you have to specify "*.jpg"
if the files are saved as jpg
instead of jpeg
. I assume that it is also possible that images are stored using a different format. I have no idea whether images can be stored in another location other than the word/media
directory. You can use unzip -l
to list the contents of the archive.
I wrote an open source Python program called ofc_media that basically does the unzipping mentioned in lesmana's answer, but automates the search process a bit. It also works on OpenDocument format documents, can limit the extraction to certain file extensions, etc.