deduplicating and indexing directories of images across 150 linux machines

I have a client with 150 Linux servers spread about various cloud services and physical data-centres. Much of this infrastructure is acquired projects/teams and pre-existing servers/installs.

The client is largely about image processing, and many of the servers have large SAN or local disk arrays with millions of jpeg/png files.

There is a configuration management agent on each box, I can see that many disks are 100%, some are pretty empty, and there is a lot of duplicated data.

The client now has access to a CDN. But at the moment just enumerating what is possible is a daunting task.

Are there any tools to create useful indexes of all this data?

I see tools like GlusterFS for managing these distributed filesystems, and Hadoop HDFS

I am wondering whether I can use the indexing tools of these systems without actually implementing the underlying volume management tools.

What should the starting point for generating an index of potential de-duplication candidates?


Solution 1:

The easiest way I found to find duplicate files across a bunch of systems is to create a list of files with their MD5 sums for each system, combine them into one file, then use sort + an AWK script to find the duplicates, as follows:

First, run this on each of the systems, replacing the path as appropriate:

#!/bin/sh
find /path/to/files -type f -exec md5sum {} \; |\
while read md5 filename
do
    echo -e "${HOSTNAME}\t${md5}\t${filename}"
done >/var/tmp/${HOSTNAME}.filelist

This will produce a file /var/tmp/HOSTNAME.filelist on each host, which you will have to copy to a central location. Once you have gathered up all these filelists, you can then run the following:

#!/bin/sh
export LANG=C
cat *.filelist |sort -t$'\t' +1 -2 |\
awk '
BEGIN {
    FS = "\t"
    dup_count = 0
    old_md5 = ""
}

{
    if ($2 == old_md5) {
        if (dup_count == 0 ) {
            printf("\n%s\n", old_inline)
        }
        printf("%s\n", $0)
        dup_count++
    }
    else {
        dup_count = 0
    }
    old_md5 = $2
    old_inline = $0
}'

This should produce an output file which groups in blocks files who's contents are duplicate either within the same host, or across hosts.

Oh, and as an alternative to the first script (which gets run on every host), check with the backup system in use to see if you can get something similar from the backup report (something that includes md5 and filename, at least).