Anything faster than grep? [closed]
I am looking for a tool that will be faster than grep, maybe a multi-threaded grep, or something similar... I have been looking at a bunch of indexers, but I am not sold that I need an index...
I have about 100 million text files, that I need to grep for exact string matches, upon finding a string match, I need the filename where the match was found.
ie: grep -r 'exact match' > filepaths.log
Its about 4TB of data, and I started my first search 6 days ago, and grep is still running. I have another dozen searches to go and I can't wait 2 months to retrieve all these filenames =]
I've reviewed the following, however, I don't think I need all the bells and whistles these indexers come with, I just need the filename where the match occurred...
- dtSearch
- Terrier
- Lucene
- Xapian
- Recoil
- Sphinx
and after spending hours reading about all those engines, my head is spinning, and I wish I just had a multi-threaded grep lol, any ideas, and/or suggestions are greatly appreciated!
PS: I am running CentOS 6.5
EDIT: Searching for multi-threaded grep returns several items, My question is, is a multi-threaded grep the best option for what I am doing?
EDIT2: After some tweaking, this is what I have come up with, and it is going much faster than the regular grep, I still wish it was faster though... I am watching my disk io wait, and its not building up yet, I may do some more tweaking, and def still interested in any suggestions =]
find . -type f -print0 | xargs -0 -n10 -P4 grep -m 1 -H -l 'search string'
grep
is I/O bound, meaning its speed is dominated by how fast it can read the files it is searching. Multiple searches in parallel can compete with each other for disk I/O, so you may not see much speedup.
If you just need matching filenames, and not the actual matches found in the files, then you should run grep with the -l
flag. This flag causes grep to just print filenames that match, and not print the matching lines. The value here is that it permits grep to stop searching a file once it has found a match, so it could reduce the amount of work that grep has to do.
If you're searching for fixed strings rather than regular expressions, then you could try using fgrep
rather than grep
. Fgrep is a variant of grep that searches for fixed strings, and searching for fixed strings is faster than running a regular expression search. You may or may not see any improvement from this, because modern versions of grep are probably smart enough to optimize fixed-string searches anyway.
If you want to try running multiple searches in parallel, you could do it using shell utilities. One way would be to build a list of filenames, split it into parts, and run grep separately for each list:
find /path/to/files -type f -print | split -l 10000000 list.
for file in list.*; do
grep -f ${file} -l 'some text' > ${file}.out &
done
wait
cat $*.out > filepaths.log
rm list.*
This uses find
to find the files, splits the list of filenames into groups of ten million, and runs grep in parallel for each group. The output of the greps are all joined together at the end. This ought to work for files with typical names, but it'd fail for files that had newlines in their names for example.
Another approach uses xargs. First, you'd have to write a simple shell script that runs grep in the background:
#!/bin/bash
grep -l 'search text' "$@" >> grep.$$.out &
This will run grep on the list of files specified as arguments to the script, writing the result to a file named after the process's PID. The grep process runs in the background.
Then you'd run the script like this:
find /path/to/files -type f -print0 | xargs -0 -r /my/grep/script
[ wait for those to finish ]
cat grep.*.out > filepaths.log
rm grep.*.out
In this case, xargs
will bundle the filenames into groups and run the script once for each group. The script will run an instance of grep once for each group. Once all of the grep instances have finished, you can combine their outputs. Unfortunately, I couldn't think of a clever way to automatically wait for the grep instances to finish here, so you might have to do that manually.