Identifying .doc/.docx files that contain images
I'm moving my notes to evernote. To this end I need to convert .doc/.docx files to rtf. The reason for this is that I have a script to import rtf into evernote. However, some of my .doc/.docx files contain images.
Is there any way to identify which .doc/.docx files contain images without viewing them all? I have thousands. This way I can simply open the few that have images and copy/paste the entire content straight into evernote.
Should say that I'm using OS X 10.6.8.
Solution 1:
Where do .doc files store images?
Word doc
files are actually zipped and then put into a container format. They store media somewhere in this compiled file format, probably right after the doc
format's header. After the image data, there's your real document as a zip-compatible folder.
So when you try to unzip a doc
file, you get an excess number of bytes at the beginning. These are your images (plus the format header). You could now attempt to unzip
the file and check the excess amount of bytes.
charon:test werner$ unzip -c images.doc > /dev/null
warning [images.doc]: 47166 extra bytes at beginning or within zipfile
charon:test werner$ unzip -c noimages.doc > /dev/null
warning [noimages2.doc]: 6060 extra bytes at beginning or within zipfile
Through testing, I found the header of "plaintext" Word documents to be 6060 bytes large (some are a bit larger though). We can try to exploit it for determining whether there's an image inside a document. Let's just say 8000 bytes — since real images will definitely have more than a few KB.
What about .docx files?
With the Office 2007 format (docx
), this is much easier. These are actual zipped files, and any Word file that contains embedded media of any kind (images, video) will include the file.docx/word/media
directory. So, we just need to unzip the docx
files and check if that directory exists.
A script to check for images
-
Create a new empty file, call it
docx-images.rb
, and paste the following content:#!/usr/bin/env ruby require 'open3' TEMPDIR = "/tmp/word/" # check for docx files Dir.glob("**/*.docx").each do |file| system("rm -rf '#{TEMPDIR}'") system("unzip '#{file}' -d #{TEMPDIR} > /dev/null") if File.directory?("#{TEMPDIR}/word/media/") puts file end end # check for doc files Dir.glob("**/*.doc").each do |file| stdin, stdout, stderr = Open3.popen3("unzip -c '#{file}' > /dev/null") info = stderr.readlines[0] info = info.gsub(" extra bytes at beginning or within zipfile", "").gsub(/warning\s\[.*\]:\s+/, "") if info.to_i > 8000 # assume a little more than usual header size puts file end end
Save it somewhere, preferably in a folder where you want to begin your search for
docx
files from, maybe yourDocuments
folder.Now, open up Terminal.app, and use
cd ~/Documents
to go there.Type
ruby docx-images.rb
, and it will recursively scan yourDocuments
folder fordocx
anddoc
files. It'll unzip the former to/tmp/word
, and check if they contain embedded media. The latter are just unzipped to/dev/null
, thus leaving no traces behind.You'll end up with a list of those with embedded media.
Proof
To prove that this works, I created four files. One with images, one without images – both as doc
and docx
:
Then, running the script:
charon:test werner$ ruby docx-images.rb
images.docx
images.doc
Clearly, the script could be improved to check for actual images in that media
folder, but it's unlikely it exists unless the file really contains any media. The same goes for the "6060" bytes check. It's a hack, but it works for me.
Of course, the script depends on the implementation of unzip
on the respective system, but it works for the OS X version.