Deleting lines from one file which are in another file
I have a file f1
:
line1
line2
line3
line4
..
..
I want to delete all the lines which are in another file f2
:
line2
line8
..
..
I tried something with cat
and sed
, which wasn't even close to what I intended. How can I do this?
Solution 1:
grep -v -x -f f2 f1
should do the trick.
Explanation:
-
-v
to select non-matching lines -
-x
to match whole lines only -
-f f2
to get patterns fromf2
One can instead use grep -F
or fgrep
to match fixed strings from f2
rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2
as regex patterns).
Solution 2:
Try comm instead (assuming f1 and f2 are "already sorted")
comm -2 -3 f1 f2
Solution 3:
For exclude files that aren't too huge, you can use AWK's associative arrays.
awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt
The output will be in the same order as the "from-this.txt" file. The tolower()
function makes it case-insensitive, if you need that.
The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)
Solution 4:
Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR
trick):
awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt
Accessing r[$0]
creates the entry for that line, no need to set a value.
Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.