Find IDs in one file that are not in another
Solution 1:
If your goal is to find common or uncommon lines, comm
would be my go-to command here.
It compares two files and shows —in three columns— lines that are unique to file 1, lines that are unique to file 2 and lines that appear in both files, respectively. You can pass it flags to suppress any of this output too. Eg comm -1 file1 file2
will suppress the first column, the things unique to file1. comm -12 file1 file2
would only show things in both files.
There's one big caveat: the input must be sorted. We can work around this.
This will show you everything in abc which isn't in mno:
comm -23 <(sort abc.txt) <(sort mno.txt)
And you can pipe that into wc -l
to get a count.
The reason I go with comm
is that once the files are sorted, side-by-side comparison is computationally really simple. If you're dealing with millions of these, that will make a difference.
This can be demonstrated with a couple of mock files. I have a fairly fast computer so to show the difference between approaches, I need quite a mammoth sample set. I've gone to 10 million 10-char strings per file.
$ cat /dev/urandom | tr -dc '0-9' | fold -w 10 | head -10000000 > abc.txt
$ cat /dev/urandom | tr -dc '0-9' | fold -w 10 | head -10000000 > mno.txt
$ time comm -23 <(sort abc.txt) <(sort mno.txt) | wc -l
... 0m10.653s
$ time grep -Fcxv -f abc.txt mno.txt
... 0m23.920s
$ time grep -Fcwv -f abc.txt mno.txt
... 0m40.313s
$ time awk 'NR==FNR{a[$0]++};NR!=FNR && a[$0]' abc.txt mno.txt | wc -l
... 0m12.161s
The sorting is what takes most of the time in mine. If we pretend that abc.txt is static, we can pre-sort it and that makes future comparisons much faster:
$ sort abc.txt abc-sorted.txt
$ time comm -23 abc-sorted.txt <(sort mno.txt) | wc -l
... 0m7.426s
You might look at these and consider a few seconds irrelevant but I have to highlight that these are running on a high end machine. If you wanted to do this on a (eg) Raspberry Pi 3, you'll be looking at much slower turnarounds and the difference will increase to a point it actually matters.
Solution 2:
to get a list :
grep -Fwf abc.txt mno.txt
it gives you something similar to:
abcd
abcd
zef
if you want to just get a unique list then use it like:
grep -Fwf abc.txt mno.txt | sort | uniq
and to get the counts:
grep -Fcwv -f abc.txt mno.txt
-
-F
means: interpret PATTERN as a list of fixed strings instead of regular expressions. -
-f
obtain patterns from FILE which going to beabc.txt
. - we look into
mno.txt
for patterns -
-c
Count the number of matches -
-w
Only look for "whole words": the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore. -
-v
Reverse the search
Solution 3:
We could use awk to do the job by passing two files, first the pattern file, then the file we want to check. When we're reading first file, we know that NR==FNR
and at that time we can read lines into array. When NR!=FNR
we check if array for such line is set.
$ cat abc.txt
abcd
xyz
pqrs
$ cat mno.txt
zzon
xyz
mkno
abcd
$ awk 'NR==FNR{a[$0]++};NR!=FNR && a[$0]' abc.txt mno.txt
xyz
abcd
Conversely, we can negate the pattern to print those lines that aren't in abc.txt
$ awk 'NR==FNR{a[$0]++};NR!=FNR && ! a[$0]' abc.txt mno.txt
zzon
mkno
And if we want to print the count of those we can employ sort
and wc
:
$ awk 'NR==FNR{a[$0]++};NR!=FNR && ! a[$0]' abc.txt mno.txt | sort -u | wc -l
2
Solution 4:
If either of the word lists is unsorted it would be faster to use an efficient set data structure to remember the common words.
Python
#!/usr/bin/env python3
import sys
with open(sys.argv[1]) as minuend_file:
minuend = frozenset(map(str.rstrip, minuend_file))
with open(sys.argv[2]) as subtrahend_file:
subtrahend = frozenset(map(str.rstrip, subtrahend_file))
difference = minuend - subtrahend
#print(*difference, sep='\n') # This prints the content of the set difference
print(len(difference)) # This prints the magnitude of the set difference
Usage:
python3 set-difference.py abc.txt mno.txt
Python (more efficient)
If you want to save a little memory for intermediary storage and run time you can use this slightly more difficult to understand program:
#!/usr/bin/env python3
import sys
with open(sys.argv[1]) as minuend_file:
minuend = set(map(str.rstrip, minuend_file))
with open(sys.argv[2]) as subtrahend_file:
subtrahend = map(str.rstrip, subtrahend_file)
minuend.difference_update(subtrahend)
difference = minuend
del minuend
#print(*difference, sep='\n') # This prints the content of the set difference
print(len(difference)) # This prints the magnitude of the set difference
Performance
Given abc.txt
and mno.txt
with 1 mio unsorted lines of 10 random ASCII digit characters each (see Oli's answer for the set-up):
$ time python3 set-difference.py abc.txt mno.txt
user 0m10.453s
vs.
$ export LC_COLLATE=C
$ time sort abc.txt > abc_sorted.txt
user 0m10.652s
$ time sort mno.txt > mno_sorted.txt
user 0m10.767s
$ time comm -23 abc_sorted.txt mno_sorted.txt | wc -l
9989882
user 0m1.600s
total: 23 seconds