Grep to filter gigantic CSV file

I'm dealing with a CSV file that contains over 2 million lines. Quite a big one.

I need to use grep (or any better way) to retrieve the entire row if the value of the second column matches 'jpn' or 'por'.

I tried using grep '<\jpn\>' || grep '<\por\>' file1.csv > file2.csv but had no success so far. The major issue being that 'por' is a common occurrence on the third column, which generates over a million of undesired rows.

Would any more experienced user be se kind and lend me some help?

Thanks in advance! :D


What Linux OS are you on? Try using egrep if you want to use regular expressions in systems whose grep version is outdated (e.g. Solaris).

Anyway, here is an awk solution:

awk -F, '$2 ~ /jpn|por/ {print}' file1.csv > file2.csv

Explanations:

  • awk for column-based operations
  • -F, to define the column separator/tokenizer, in this case I use a comma
  • $2 ~ /jpn|por/ tests column #2 with expression /jpn|por/
    • $2 is column #2
    • /jpn|por/ is a regular expression to match jpn or por
  • {print} specifies what the awk should output if it found a matching line
    • print to print the whole input line (alternatively, print $3 will just print column #3)
  • ... file1.csv specifies to read from an input file instead of stdin

grep '; jpn;\|; por;' /path/to/file1.csv > file2.csv

Assuming that semicolon (;) is the separator.


I think grep is a limited solution for this problem, because it wouldn't consider different separators and escaped separators.

I'd suggest you check out TextQ (disclaimer - I'm its developer). It can import a big CSV file and allows you to manage its schema/structure.

You can perform queries via a UI Query Builder or via SQL (select, join, group by, etc).

Here is a video demo and more detailed tutorial.

Finally, you can export any query to a CSV file, which can be imported in MS Excel or others. You can get it from the Mac App Store or Microsoft Store (soon).