How to deduplicate 40TB of data?

Or, is there a way to direct a process to use disk-mapped RAM so there's much more available and it doesn't use system RAM?

Yes, It's called the swap drive. You probably already have one. If you're worried about running out of RAM then increasing this is a good place to start. It works automatically though so there is no need to do anything special.

I would not worry about fdupes. Try it, it should work without problems.


finding duplicates based on hashkey works well and is very fast.

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate