Create an md5sum of each file, duplicates md5sums suggest (but doesn't guarantee) duplicate files.


You could use dupemerge to turn the identical files into hardlinks. It'll take a very long time on a large file set though. SHA (or MD5) hashes of the files will almost certainly work faster, but you'll have to do more legwork in finding the duplicates. The probability of accidental collision is so low that in reality you can ignore it. (In fact, many deduplication products already do this.)

Your best bet for dealing with photos and music is to get tools tailored to finding duplicates of those items in particular. Especially since you may not have files that are identical at a binary level after things like tagging or cropping or encoding differences come into play. You'll want tools that can find photos that "look" the same and music that "sounds" the same even if minor adjustments have been made to the files.


Well, if you have the ability, you can set up a deduplicating filesystem and put your backups on that. This will not only deduplicate whole files, but also similar pieces of files. For example, if you have the same JPEG in several places, but with different EXIF tags on each version, a deduplicating filesystem would only store the image data once.

Deduplicating filesystems include lessfs, ZFS, and SDFS.