Duplicate File Scanner

Solution 1:

If you haven't done so already, you may be able to work around your problem by cramming more RAM into the machine that's running the duplicate detector (assuming it isn't already maxed out). You also can work around your problem by splitting the remaining files into subsets and scanning pairs of those subsets until you've tried every combination. However, in the long run, this may not be a problem best tackled with a duplicate detector program that you have to run periodically.

You should look into a file server with data deduplication. In a nutshell, this will automatically only store 1 physical copy of each file, with each "copy" hardlinked to the single physical file. (Some systems actually use block-level deduplication rather than file-level dedup, but the concept is the same.)

Newer advanced filesystems such as ZFS, BTRFS, and lessfs have dedup support, as does the OpenDedup fileserver appliance OS. One or more of those filesystems might already be available on your Linux servers. Windows Storage Server also has dedup. If you have some money to throw at the problem, some commercial SAN/NAS solutions have dedup capability.

Keep in mind, though, that dedup will not necessarily help with small, slightly modified versions of the same files. If people are littering your servers with multiple versions of their files all over the place, you should try to get them to organize their files better and use a version control system--which only saves the original file and the chain of incremental differences.

Update:

64 GB should be sufficient for caching at least 1 billion checksum-file path entries in physical memory, assuming 128-bit checksums and average metadata (filesystem path, file size, date, etc.) no longer than 52 bytes. Of course, the OS will start paging at some point, but the program shouldn't crash--that is, assuming the duplicate file finder itself is a 64-bit application.

If your duplicate file finder is only a 32-bit program (or if it's a script running on a 32-bit interpreter), the number of files you can process could be vastly less if PAE is not enabled: more on the order of 63 million (4 GB / (128 bits + 52 bytes)), under the same assumptions as before. If you have more than 63 million files, you use a larger checksum, or if the average metadata cached by the program is larger than 52 bytes, then you probably just need to find a 64-bit duplicate file finder. In addition to the programs mgorven suggested (which I assume are available in 64-bit, or at least can be easily recompiled), there is a 64-bit version of DupFiles available for Windows.

Solution 2:

Have you tried rdfind, fdupes and findup from fslint?

Duplicate File Scanner

Solution 1:

Solution 2:

Related

Recent Posts