Find and search inside all compressed files

I'd like to scan my hard drive for all compressed file collections like zip, gzip, bzip, and others and have the content of those searched for certain file types (such as images). Anti-virus' do it, so I believe there should be a way.


The simplest approach would be to list the contents of the archive and look for files of the relevant extension. For example, with a zip file:

$ zip -sf foo.zip | grep -iE '\.png$|\.jpg$'
  file1.jpg
  file1.png
  file2.jpg
  file2.png

The -sf option tells zip to list the files contained in an archive. Then, the grep will look for a .png or .jpg that are at the end of the line ($). The -E enables extended regular expressions, so we can use | as OR and the -i makes the matching case insensitive.

However, each archive tool has a different command to list the contents. I've written a script that can deal with most of the more popular ones. If you save that script as list_compressed.sh, you could then run:

list_compressed.sh | grep -iE '\.png$|\.jpg$|\.jpeg$|\.gif$|\.tif$|\.tiff$'

That would show you the most common image types. Note that this approach assumes that the file type can be determined by the file's extension. It will not find image files that don't have an extension and it will not recognize files with the wrong extension. There is no way to deal with that without actually extracting the files from the archive and running file on each of them.


If you want to find all archives that contain image files on your hard drive, combine the above with find:

find / -name '*.gz' -o -name '*.tgz' -o -name '*.zip' -print0 |
    while IFS= read -r -d '' arch; do    
        list_compressed.sh "$arch" | 
            grep -qiE '\.png$|\.jpg$|\.jpeg$|\.gif$|\.tif$|\.tiff$' &&
                echo "$arch contains image(s)"
    done

The find command will search for all .gz, .tgz or .zip files (you can add as many extensions as you like), those are then passed through my script. The -q suppresses grep's normal output, nothing will be printed. The && echo will print the archive's name only if the grep was successful.


Not as advanced as terdon, but this will do:

Save the following code, in a folder where all your code resides in, as finda.sh, or any other name as you like:

for file in *.*; do
    if ( 7z l -slt "$file"> /tmp/$file.log); then
       echo $file:; cat /tmp/$file.log | grep -iE 'Path*'> $file.log && cat $file.log
    fi
done

Then in a dir were all of your archives are in, run it and this is the output:

./finda.sh 
one.7z:
Path = one/abradabra.png
Path = one/birb.png
three.rar:
Path = three/blah.png
Path = three/qwa0g.jpg
two.zip:
Path = two/whut.png