Filter file when a column is in whitelist
There are many ways of doing this. Here are a few but use the Perl one it is orders of magnitude faster. I include the others for the sake of completeness:
-
Perl and hashes, ridiculously fast
perl -e 'open(A,"fileB"); while(<A>){chomp; $k{$_}++} while(<>){@a=split(/,/); print if defined $k{$a[0]}}' fileA
-
gawk and associative arrays, much slower
gawk '{if(FILENAME==ARGV[1]){array[$1]=1} else{if($1 in array){print}}}' fileA fileB
-
grep
, ridiculously slow. You will need to modify your fileB slightly to make the patterns match only on the first linesed 's/\(.\)/^\1/' fileB > fileC grep -f fileC fileA
I created a couple of test files and it turns out that the Perl solutions is much faster than the grep
:
$ head -2 fileA
GO:0032513_GO:0050129
GO:0050129_GO:0051712
$ head -2 fileB
GO:0032513_GO:0050129
GO:0050129_GO:0051712
$ wc -l fileA fileB
1500000 fileA
20000000 fileB
$ time perl -e 'open(A,"fileB"); while(<A>){chomp; $k{$_}++}
while(<>){@a=split(/,/); print if defined $k{$a[0]}}' fileA > /dev/null
real 0m41.354s
user 0m37.370s
sys 0m3.960s
$ time gawk '{if(FILENAME==ARGV[1]){array[$1]=1}
else{if($1 in array){print}}}' fileA fileB
real 2m30.963s
user 1m23.857s
sys 0m9.385s
$ time (join -t, <(sort -n fileA) <(sort -n fileB) >/dev/null)
real 8m29.532s
user 13m52.576s
sys 1m22.029s
So, the Perl scriptlet can go through a 20 million line file looking for 1.5 million patterns and finish in ~40 seconds. Not bad. The other two are much slower, gawk
took 2.5 minutes and the grep
one has been running for more than 15. Perl wins hands down.
This should do the trick:
join -t, <(sort A) <(sort B)