Find IDs in one file that are not in another

Solution 1:

If your goal is to find common or uncommon lines, comm would be my go-to command here.

It compares two files and shows —in three columns— lines that are unique to file 1, lines that are unique to file 2 and lines that appear in both files, respectively. You can pass it flags to suppress any of this output too. Eg comm -1 file1 file2 will suppress the first column, the things unique to file1. comm -12 file1 file2 would only show things in both files.

There's one big caveat: the input must be sorted. We can work around this.

This will show you everything in abc which isn't in mno:

comm -23 <(sort abc.txt) <(sort mno.txt)

And you can pipe that into wc -l to get a count.


The reason I go with comm is that once the files are sorted, side-by-side comparison is computationally really simple. If you're dealing with millions of these, that will make a difference.

This can be demonstrated with a couple of mock files. I have a fairly fast computer so to show the difference between approaches, I need quite a mammoth sample set. I've gone to 10 million 10-char strings per file.

$ cat /dev/urandom | tr -dc '0-9' | fold -w 10 | head -10000000 > abc.txt
$ cat /dev/urandom | tr -dc '0-9' | fold -w 10 | head -10000000 > mno.txt

$ time comm -23 <(sort abc.txt) <(sort mno.txt) | wc -l
... 0m10.653s

$ time grep -Fcxv -f abc.txt mno.txt
... 0m23.920s

$ time grep -Fcwv -f abc.txt mno.txt
... 0m40.313s

$ time awk 'NR==FNR{a[$0]++};NR!=FNR && a[$0]' abc.txt  mno.txt | wc -l
... 0m12.161s

The sorting is what takes most of the time in mine. If we pretend that abc.txt is static, we can pre-sort it and that makes future comparisons much faster:

$ sort abc.txt abc-sorted.txt
$ time comm -23 abc-sorted.txt <(sort mno.txt) | wc -l
... 0m7.426s

You might look at these and consider a few seconds irrelevant but I have to highlight that these are running on a high end machine. If you wanted to do this on a (eg) Raspberry Pi 3, you'll be looking at much slower turnarounds and the difference will increase to a point it actually matters.

Solution 2:

to get a list :

grep -Fwf abc.txt mno.txt

it gives you something similar to:

abcd
abcd
zef

if you want to just get a unique list then use it like:

grep -Fwf abc.txt mno.txt | sort | uniq

and to get the counts:

grep -Fcwv -f abc.txt mno.txt

  • -F means: interpret PATTERN as a list of fixed strings instead of regular expressions.
  • -f obtain patterns from FILE which going to be abc.txt.
  • we look into mno.txt for patterns
  • -c Count the number of matches
  • -w Only look for "whole words": the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
  • -v Reverse the search

Solution 3:

We could use awk to do the job by passing two files, first the pattern file, then the file we want to check. When we're reading first file, we know that NR==FNR and at that time we can read lines into array. When NR!=FNR we check if array for such line is set.

$ cat abc.txt                                                      
abcd
xyz
pqrs
$ cat mno.txt                                                      
zzon
xyz
mkno
abcd
$ awk 'NR==FNR{a[$0]++};NR!=FNR && a[$0]' abc.txt  mno.txt         
xyz
abcd

Conversely, we can negate the pattern to print those lines that aren't in abc.txt

$ awk 'NR==FNR{a[$0]++};NR!=FNR && ! a[$0]' abc.txt  mno.txt       
zzon
mkno

And if we want to print the count of those we can employ sort and wc:

$ awk 'NR==FNR{a[$0]++};NR!=FNR && ! a[$0]' abc.txt  mno.txt | sort -u | wc -l         
2

Solution 4:

If either of the word lists is unsorted it would be faster to use an efficient set data structure to remember the common words.

Python

#!/usr/bin/env python3
import sys

with open(sys.argv[1]) as minuend_file:
    minuend = frozenset(map(str.rstrip, minuend_file))
with open(sys.argv[2]) as subtrahend_file:
    subtrahend = frozenset(map(str.rstrip, subtrahend_file))

difference = minuend - subtrahend
#print(*difference, sep='\n') # This prints the content of the set difference
print(len(difference)) # This prints the magnitude of the set difference

Usage:

python3 set-difference.py abc.txt mno.txt

Python (more efficient)

If you want to save a little memory for intermediary storage and run time you can use this slightly more difficult to understand program:

#!/usr/bin/env python3
import sys

with open(sys.argv[1]) as minuend_file:
    minuend = set(map(str.rstrip, minuend_file))
with open(sys.argv[2]) as subtrahend_file:
    subtrahend = map(str.rstrip, subtrahend_file)
    minuend.difference_update(subtrahend)
    difference = minuend
    del minuend

#print(*difference, sep='\n') # This prints the content of the set difference
print(len(difference)) # This prints the magnitude of the set difference

Performance

Given abc.txt and mno.txt with 1 mio unsorted lines of 10 random ASCII digit characters each (see Oli's answer for the set-up):

$ time python3 set-difference.py abc.txt mno.txt
user    0m10.453s

vs.

$ export LC_COLLATE=C
$ time sort abc.txt > abc_sorted.txt
user    0m10.652s
$ time sort mno.txt > mno_sorted.txt
user    0m10.767s
$ time comm -23 abc_sorted.txt mno_sorted.txt | wc -l
9989882
user    0m1.600s

total: 23 seconds