Find file duplicates and convert them into links [WINDOWS] [closed]

My users tend to save tons of duplicate files what consumes more and more space and generate HW and archiving cost.

Im thinking to create some scheduled job, to:

  1. find duplicate files (check file MD5 sum, not only filename / size)
  2. leave only 1 original file
  3. replace other redundant copies by link (shortcut) to file (point above)

Any idea how to archive that?

Script / tool / tips ?

EDIT 28.10.2021

Ive found in the meantime findDupe: https://www.sentex.ca/~mwandel/finddupe/

It allows to create hardlinks to original files. Ive tried this - it shows correctly what is duplicated, seems creating hardlinks - but... I cant see difference in HDD usage stats after all

Why that? Can it be Windows calculates free space incorrectly ?


I made a small script in python who answer your needs.

It use fdupes -r <dir> in order to get all duplicates files (even with different names). After that, it iterate over the output and delete duplicated files, then make a symbolic link.

I let you uncomment the two os.system() lines in order to enable the modifications.

Maybe you'll want to pass parameter to this script (like a path or other), I let you search for this need :)

import os

root_dir='/home/user/directory'

blocks_of_dup_files = os.popen('fdupes -r ' + root_dir).read().split('\n\n')

if(blocks_of_dup_files[-1] == '') :
    blocks_of_dup_files.pop()


for files in blocks_of_dup_files:
    files = files.split('\n')
    keeped_file = files.pop()
    for file in files:
        print('rm -f ' + file)
        print('ln -s ' + keeped_file + ' ' + file)

        #os.system('rm -f ' + file)
        #os.system('ln -s ' + keeped_file + ' ' + file)


For Windows I authord https://github.com/Caspeco/BlobBackup/tree/master/DuplicateFinder

You will need visual studio to compile the code. Note tho, that with links if one "file" is modified, then all are (or rather, there is only one file). That could be unwanted behaviour for users.