Remove Duplicate Messages from Maildir

I've got a bunch of duplicate messages in my IMAP server's Maildir. What's the best way to remove them?

Some relevant points:

  • Shared Message-ID is usually a good enough definition of duplicate. A tiny script that removes all but one of the duplicate messages would work.
  • Sometimes it's necessary to find duplicates based on shared message bodies. What's a reasonable definition of shared here? Bitwise equivalent? What about weird differences in line wrapping, escaping, character encoding?
  • Sometimes there's some meaningful difference between 'duplicate' messages. What's the best way to review the differences in sets of 'duplicate' messages? Diffs?

Solution 1:

I've made some significant improvements to Kevin's script mentioned above, and he was kind enough to accept my pull requests. Eventually we split this off into a dedicated project which you can find here:

https://github.com/kdeldycke/maildir-deduplicate

Solution 2:

for generic files in linux, I use fdupes utils to remove duplicate files. I found it also works for Maildir messages.