Identifying .doc/.docx files that contain images

I'm moving my notes to evernote. To this end I need to convert .doc/.docx files to rtf. The reason for this is that I have a script to import rtf into evernote. However, some of my .doc/.docx files contain images.

Is there any way to identify which .doc/.docx files contain images without viewing them all? I have thousands. This way I can simply open the few that have images and copy/paste the entire content straight into evernote.

Should say that I'm using OS X 10.6.8.


Solution 1:

Where do .doc files store images?

Word doc files are actually zipped and then put into a container format. They store media somewhere in this compiled file format, probably right after the doc format's header. After the image data, there's your real document as a zip-compatible folder.

file layout

So when you try to unzip a doc file, you get an excess number of bytes at the beginning. These are your images (plus the format header). You could now attempt to unzip the file and check the excess amount of bytes.

charon:test werner$ unzip -c images.doc > /dev/null
warning [images.doc]:  47166 extra bytes at beginning or within zipfile

charon:test werner$ unzip -c noimages.doc > /dev/null
warning [noimages2.doc]:  6060 extra bytes at beginning or within zipfile

Through testing, I found the header of "plaintext" Word documents to be 6060 bytes large (some are a bit larger though). We can try to exploit it for determining whether there's an image inside a document. Let's just say 8000 bytes — since real images will definitely have more than a few KB.


What about .docx files?

With the Office 2007 format (docx), this is much easier. These are actual zipped files, and any Word file that contains embedded media of any kind (images, video) will include the file.docx/word/media directory. So, we just need to unzip the docx files and check if that directory exists.


A script to check for images

  • Create a new empty file, call it docx-images.rb, and paste the following content:

    #!/usr/bin/env ruby
    require 'open3'
    TEMPDIR = "/tmp/word/"
    
    # check for docx files
    Dir.glob("**/*.docx").each do |file|
      system("rm -rf '#{TEMPDIR}'")
      system("unzip '#{file}' -d #{TEMPDIR} > /dev/null")
      if File.directory?("#{TEMPDIR}/word/media/")
        puts file
      end
    end
    
    # check for doc files
    Dir.glob("**/*.doc").each do |file|
      stdin, stdout, stderr = Open3.popen3("unzip -c '#{file}' > /dev/null")
      info = stderr.readlines[0]
      info = info.gsub(" extra bytes at beginning or within zipfile", "").gsub(/warning\s\[.*\]:\s+/, "")
      if info.to_i > 8000 # assume a little more than usual header size
        puts file
      end
    end
    
  • Save it somewhere, preferably in a folder where you want to begin your search for docx files from, maybe your Documents folder.

  • Now, open up Terminal.app, and use cd ~/Documents to go there.

  • Type ruby docx-images.rb, and it will recursively scan your Documents folder for docx and doc files. It'll unzip the former to /tmp/word, and check if they contain embedded media. The latter are just unzipped to /dev/null, thus leaving no traces behind.

  • You'll end up with a list of those with embedded media.


Proof

To prove that this works, I created four files. One with images, one without images – both as doc and docx:

proof

Then, running the script:

charon:test werner$ ruby docx-images.rb 
images.docx
images.doc

Clearly, the script could be improved to check for actual images in that media folder, but it's unlikely it exists unless the file really contains any media. The same goes for the "6060" bytes check. It's a hack, but it works for me.

Of course, the script depends on the implementation of unzip on the respective system, but it works for the OS X version.