Remove Duplicate Messages from Maildir
I've got a bunch of duplicate messages in my IMAP server's Maildir. What's the best way to remove them?
Some relevant points:
- Shared Message-ID is usually a good enough definition of duplicate. A tiny script that removes all but one of the duplicate messages would work.
- Sometimes it's necessary to find duplicates based on shared message bodies. What's a reasonable definition of shared here? Bitwise equivalent? What about weird differences in line wrapping, escaping, character encoding?
- Sometimes there's some meaningful difference between 'duplicate' messages. What's the best way to review the differences in sets of 'duplicate' messages? Diffs?
Solution 1:
I've made some significant improvements to Kevin's script mentioned above, and he was kind enough to accept my pull requests. Eventually we split this off into a dedicated project which you can find here:
https://github.com/kdeldycke/maildir-deduplicate
Solution 2:
for generic files in linux, I use fdupes utils to remove duplicate files. I found it also works for Maildir messages.