Grep to filter gigantic CSV file
I'm dealing with a CSV file that contains over 2 million lines. Quite a big one.
I need to use grep (or any better way) to retrieve the entire row if the value of the second column matches 'jpn' or 'por'.
I tried using grep '<\jpn\>' || grep '<\por\>' file1.csv > file2.csv
but had no success so far. The major issue being that 'por' is a common occurrence on the third column, which generates over a million of undesired rows.
Would any more experienced user be se kind and lend me some help?
Thanks in advance! :D
What Linux OS are you on? Try using egrep
if you want to use regular expressions in systems whose grep version is outdated (e.g. Solaris).
Anyway, here is an awk solution:
awk -F, '$2 ~ /jpn|por/ {print}' file1.csv > file2.csv
Explanations:
-
awk
for column-based operations -
-F,
to define the column separator/tokenizer, in this case I use a comma -
$2 ~ /jpn|por/
tests column #2 with expression/jpn|por/
-
$2
is column #2 -
/jpn|por/
is a regular expression to match jpn or por
-
-
{
print
}
specifies what the awk should output if it found a matching line-
print
to print the whole input line (alternatively,print $3
will just print column #3)
-
-
... file1.csv
specifies to read from an input file instead of stdin
grep '; jpn;\|; por;' /path/to/file1.csv > file2.csv
Assuming that semicolon (;) is the separator.
I think grep
is a limited solution for this problem, because it wouldn't consider different separators and escaped separators.
I'd suggest you check out TextQ (disclaimer - I'm its developer). It can import a big CSV file and allows you to manage its schema/structure.
You can perform queries via a UI Query Builder or via SQL (select, join, group by, etc).
Here is a video demo and more detailed tutorial.
Finally, you can export any query to a CSV file, which can be imported in MS Excel or others. You can get it from the Mac App Store or Microsoft Store (soon).