How to see the email addresses that are present in two different documents? [closed]

I have a couple of .PDF and .DOC documents that contains thousands of email addresses in them. I want to know which email addresses are duplicates. Yes, the documents have more content than just the email addresses.

I want to see (not remove) the actual email addresses that are present in two different documents. When I say I want to "see", that means, I want to harvest those actual email addresses that are present in two different documents. I do not mind if the program "removes" the duplicates from two different documents, as long as I can see these duplicate email addresses and not simply removed and I do not know what has been removed. I want to know which email addresses are duplicates.

How can I accomplish this?


Presumably the documents have more content than just the email addresses?

A non-scripting approach would be

  1. extract email addresses to file
  2. sort
  3. remove duplicates
  4. diff result of (2) with (3)

1) Convert document to list of email addresses

  • Install Notepad++ and copy your document into it
  • Open "Find and Replace"
  • Find: (\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b) // Vaguely accurate email regex
  • Replace: \r\n$&\r\n
  • Make sure "Regular expression" option is checked

Now you have each email on its own line. Switch to the "Mark" tab and repeat the search, this time "Bookmarking" the lines using notepad++ Mark feature

Now remove unmarked lines: Search > Bookmark > Remove Unmarked Lines

2) Sort lines

Edit -> Line Operations -> Sort Lines Lexographically Ascending

Save a copy of this file.

3) Remove duplicates

Install TextFX plugin to Notepad++

Sort again, but check "Sort outputs only Unique" TextFX

4) Diff

Use a "diff tool" (WinMerge or the "Compare To" Notepad++ plugin) to compare the "sorted list with duplicates" with the "sorted list without duplicates" to yield the list of duplicated emails

Credit to https://www.kniko.net/how-to-extract-email-addresses-from-a-text-file-using-notepad-with-no-coding-at-all/ for images and email regex


You could use an online service:

  • Select all the text in the document
  • Use the browser to navigate to Email Extractor For Web Pages and Text
  • Paste your text
  • Under "Step 3: Extract Emails", click Extract
  • The emails will be displayed.