bash, Linux: Set difference between two text files
I have two files A
-nodes_to_delete
and B
-nodes_to_keep
. Each file has a many lines with numeric ids.
I want to have the list of numeric ids that are in nodes_to_delete
but NOT in nodes_to_keep
, e.g.
.
Doing it within a PostgreSQL database is unreasonably slow. Any neat way to do it in bash using Linux CLI tools?
UPDATE: This would seem to be a Pythonic job, but the files are really, really large. I have solved some similar problems using uniq
, sort
and some set theory techniques. This was about two or three orders of magnitude faster than the database equivalents.
The comm command does that.
Somebody showed me how to do exactly this in sh a couple months ago, and then I couldn't find it for a while... and while looking I stumbled onto your question. Here it is :
set_union () {
sort $1 $2 | uniq
}
set_difference () {
sort $1 $2 $2 | uniq -u
}
set_symmetric_difference() {
sort $1 $2 | uniq -u
}
Use comm
- it will compare two sorted files line by line.
The short answer to your question
This command will return lines unique to deleteNodes, and not in keepNodes.
comm -1 -3 <(sort keepNodes) <(sort deleteNodes)
Example setup
Let's create the files named keepNodes
and deleteNodes
, and use them as unsorted input for the comm
command.
$ cat > keepNodes <(echo bob; echo amber;)
$ cat > deleteNodes <(echo bob; echo ann;)
By default, running comm without arguments prints 3 columns with this layout:
lines_unique_to_FILE1
lines_unique_to_FILE2
lines_which_appear_in_both
Using our example files above, run comm without arguments. Note the three columns.
$ comm <(sort keepNodes) <(sort deleteNodes)
amber
ann
bob
Suppressing column output
Suppress column 1, 2 or 3 with -N; note that when a column is hidden, the whitespace shrinks up.
$ comm -1 <(sort keepNodes) <(sort deleteNodes)
ann
bob
$ comm -2 <(sort keepNodes) <(sort deleteNodes)
amber
bob
$ comm -3 <(sort keepNodes) <(sort deleteNodes)
amber
ann
$ comm -1 -3 <(sort keepNodes) <(sort deleteNodes)
ann
$ comm -2 -3 <(sort keepNodes) <(sort deleteNodes)
amber
$ comm -1 -2 <(sort keepNodes) <(sort deleteNodes)
bob
Sorting is important!
If you execute comm without first sorting the file, it fails gracefully with a message about which file is not sorted.
comm: file 1 is not in sorted order