grep a large list against a large file
Solution 1:
Try
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -F
option might speed up grep
.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
Solution 2:
Use grep -f
for this:
grep -f the_ids.txt huge.csv > output_file
From man grep
:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX.)
If you provide some sample input maybe we can even improve the grep
condition a little more.
Test
$ cat ids
11
23
55
$ cat huge.csv
hello this is 11 but
nothing else here
and here 23
bye
$ grep -f ids huge.csv
hello this is 11 but
and here 23
Solution 3:
grep -f filter.txt data.txt
gets unruly when filter.txt
is larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f
, we need to keep a few things in mind:
- use
-x
option if there is a need to match the entire line in the second file - use
-F
if the first file has strings, not patterns - use
-w
to prevent partial matches while not using the-x
option
This post has a great discussion on this topic (grep -f
on large files):
- Fastest way to find lines of a file from another larger file in Bash
And this post talks about grep -vf
:
- grep -vf too slow with large files
In summary, the best way to handle grep -f
on large files is:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt
and for grep -vf
:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt