Grepping a huge file (80GB) any way to speed it up?

 grep -i -A 5 -B 5 'db_pd.Clients'  eightygigsfile.sql

This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded. Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)

The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?

the grep is grabbing up to 93% CPU

Solution 1:

Here are a few options:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression.

3) Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

Solution 2:

If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:

< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'

Depending on your disks and CPUs it may be faster to read larger blocks:

< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'

It's not entirely clear from you question, but other options for grep include:

Dropping the -i flag.
Using the -F flag for a fixed string
Disabling NLS with LANG=C
Setting a max number of matches with the -m flag.

Solution 3:

Some trivial improvement:

Remove the -i option, if you can, case insensitive is quite slow.
Replace the . by \.

A single point is the regex symbol to match any character, which is also slow

Solution 4:

Two lines of attack:

are you sure, you need the -i, or do you habe a possibility to get rid of it?
Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.

Grepping a huge file (80GB) any way to speed it up?

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Related

Recent Posts