Efficiently search sorted file
I have a large file containing one string on each line. I would like to be able to quickly determine if a string is in the file. Ideally, this would be done using a binary chop type algorithm.
Some Googling revealed the look
command with the -b
flag which promises to locate and output all strings beginning with a given prefix using a binary search algorithm. Unfortunately, it doesn't seem to work correctly and returns null results for strings that I know are in the file (they are properly returned by the equivalent grep
search).
Does anyone know of another utility or strategy to search this file efficiently?
There's an essential difference between grep
and look
:
Unless explicitly stated otherwise, grep
will find patterns even somewhere within the lines. For look
the manpage states:
look — display lines beginning with a given string
I'm not using look
very often, but it did work fine on a trivial example I just tried.
Maybe a little late answer:
Sgrep will help you.
Sgrep (sorted grep) searches sorted input files for lines that match a search key and outputs the matching lines. When searching large files sgrep is much faster than traditional Unix grep, but with significant restrictions.
- All input files must be sorted regular files.
- The sort key must start at the beginning of the line.
- The search key matches only at the beginning of the line.
- No regular expression support.
You can download source here: https://sourceforge.net/projects/sgrep/?source=typ_redirect
and the documents here: http://sgrep.sourceforge.net/
Another Way:
I don't know how large is the file.Maybe you should try parallel:
https://stackoverflow.com/questions/9066609/fastest-possible-grep
I always do grep with files which size > 100GB, it works well.