Check if all of multiple strings or regexes exist in a file

I want to check if all of my strings exist in a text file. They could exist on the same line or on different lines. And partial matches should be OK. Like this:

...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on

In the above example, we could have regexes in place of strings.

For example, the following code checks if any of my strings exists in the file:

if grep -EFq "string1|string2|string3" file; then
  # there is at least one match
fi

How to check if all of them exist? Since we are just interested in the presence of all matches, we should stop reading the file as soon all strings are matched.

Is it possible to do it without having to invoke grep multiple times (which won't scale when input file is large or if we have a large number of strings to match) or use a tool like awk or python?

Also, is there a solution for strings that can easily be extended for regexes?

Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.

In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:

awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file

And here's a bunch of other information and options:

Assuming you're really looking for strings, it'd be:

awk -v strings='string1 string2 string3' '
BEGIN {
    numStrings = split(strings,tmp)
    for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
    for (str in strs) {
        if ( index($0,str) ) {
            delete strs[str]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file

the above will stop reading the file as soon as all strings have matched.

If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:

awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file

Actually, even if it were strings you could do:

awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file

The main issue with the above 2 GNU awk solutions is that, like @anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.

I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:

awk '
NR==FNR { strings[$0]; next }
{
    for (string in strings)
        if ( !index($0,string) )
            exit 1
}
' file_of_strings RS='^$' file_to_be_searched

and for regexps it'd be:

awk '
NR==FNR { regexps[$0]; next }
{
    for (regexp in regexps)
        if ( $0 !~ regexp )
            exit 1
}
' file_of_regexps RS='^$' file_to_be_searched

If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='\0' instead of RS='^$' or by appending to variable one line at a time as it's read and then processing that variable in the END section.

If your file_to_be_searched is too large to fit in memory then it'd be this for strings:

awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
    for (string in strings) {
        if ( index($0,string) ) {
            delete strings[string]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched

and the equivalent for regexps:

awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
    for (regexp in regexps) {
        if ( $0 ~ regexp ) {
            delete regexps[regexp]
            numRegexps--
        }
    }
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched

`git grep`

Here is the syntax using git grep with multiple patterns:

git grep --all-match --no-index -l -e string1 -e string2 -e string3 file

You may also combine patterns with Boolean expressions such as --and, --or and --not.

Check man git-grep for help.

--all-match When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.

--no-index Search files in the current directory that is not managed by Git.

-l/--files-with-matches/--name-only Show only the names of files.

-e The next parameter is the pattern. Default is to use basic regexp.

Other params to consider:

--threads Number of grep worker threads to use.

-q/--quiet/--silent Do not output matched lines; exit with status 0 when there is a match.

To change the pattern type, you may also use -G/--basic-regexp (default), -F/--fixed-strings, -E/--extended-regexp, -P/--perl-regexp, -f file, and other.

This gnu-awk script may work:

cat fileSearch.awk
re == "" {
   exit
}
{
   split($0, null, "\\<(" re "\\>)", b)
   for (i=1; i<=length(b); i++)
      gsub("\\<" b[i] "([|]|$)", "", re)
}
END {
   exit (re != "")
}

Then use it as:

if awk -v re='string1|string2|string3' -f fileSearch.awk file; then
   echo "all strings were found"
else
   echo "all strings were not found"
fi

Alternatively, you can use this gnu grep solution with PCRE option:

grep -qzP '(?s)(?=.*\bstring1\b)(?=.*\bstring2\b)(?=.*\bstring3\b)' file

Using -z we make grep read complete file into a single string.
We are using multiple lookahead assertions to assert that all the strings are present in the file.
Regex must use (?s) or DOTALL mod to make .* match across the lines.

As per man grep:

-z, --null-data
   Treat  input  and  output  data as sequences of lines, each terminated by a 
   zero byte (the ASCII NUL character) instead of a newline.

First, you probably want to use awk. Since you eliminated that option in the question statement, yes, it is possible to do and this provides a way to do it. It is likely MUCH slower than using awk, but if you want to do it anyway...

This is based on the following assumptions:G

Invoking AWK is unacceptable
Invoking grep multiple times is unacceptable
The use of any other external tools are unacceptable
Invoking grep less than once is acceptable
It must return success if everything is found, failure when not
Using bash instead of external tools is acceptable
bash version is >= 3 for the regular expression version

This might meet all of your requirements: (regex version miss some comments, look at string version instead)

#!/bin/bash

multimatch() {
    filename="$1" # Filename is first parameter
    shift # move it out of the way that "$@" is useful
    strings=( "$@" ) # search strings into an array

    declare -a matches # Array to keep track which strings already match

    # Initiate array tracking what we have matches for
    for ((i=0;i<${#strings[@]};i++)); do
        matches[$i]=0
    done

    while IFS= read -r line; do # Read file linewise
        foundmatch=0 # Flag to indicate whether this line matched anything
        for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
            if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
                string="${strings[$i]}" # fetch the string
                if [[ $line = *$string* ]]; then # check if it matches
                    matches[$i]=1   # mark that we have found this
                    foundmatch=1    # set the flag, we need to check whether we have something left
                fi
            fi
        done
        # If we found something, we need to check whether we
        # can stop looking
        if [ "$foundmatch" -eq 1 ]; then
            somethingleft=0 # Flag to see if we still have unmatched strings
            for ((i=0;i<${#matches[@]};i++)); do
                if [ "${matches[$i]}" -eq 0 ]; then
                    somethingleft=1 # Something is still outstanding
                    break # no need check whether more strings are outstanding
                fi
            done
            # If we didn't find anything unmatched, we have everything
            if [ "$somethingleft" -eq 0 ]; then return 0; fi
        fi
    done < "$filename"

    # If we get here, we didn't have everything in the file
    return 1
}

multimatch_regex() {
    filename="$1" # Filename is first parameter
    shift # move it out of the way that "$@" is useful
    regexes=( "$@" ) # Regexes into an array

    declare -a matches # Array to keep track which regexes already match

    # Initiate array tracking what we have matches for
    for ((i=0;i<${#regexes[@]};i++)); do
        matches[$i]=0
    done

    while IFS= read -r line; do # Read file linewise
        foundmatch=0 # Flag to indicate whether this line matched anything
        for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
            if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
                regex="${regexes[$i]}" # Get regex from array
                if [[ $line =~ $regex ]]; then # We use the bash regex operator here
                    matches[$i]=1   # mark that we have found this
                    foundmatch=1    # set the flag, we need to check whether we have something left
                fi
            fi
        done
        # If we found something, we need to check whether we
        # can stop looking
        if [ "$foundmatch" -eq 1 ]; then
            somethingleft=0 # Flag to see if we still have unmatched strings
            for ((i=0;i<${#matches[@]};i++)); do
                if [ "${matches[$i]}" -eq 0 ]; then
                    somethingleft=1 # Something is still outstanding
                    break # no need check whether more strings are outstanding
                fi
            done
            # If we didn't find anything unmatched, we have everything
            if [ "$somethingleft" -eq 0 ]; then return 0; fi
        fi
    done < "$filename"

    # If we get here, we didn't have everything in the file
    return 1
}

if multimatch "filename" string1 string2 string3; then
    echo "file has all strings"
else
    echo "file miss one or more strings"
fi

if multimatch_regex "filename" "regex1" "regex2" "regex3"; then
    echo "file match all regular expressions"
else
    echo "file does not match all regular expressions"
fi

Benchmarks

I did some benchmarking searching .c,.h and .sh in arch/arm/ from Linux 4.16.2 for the strings "void", "function", and "#define". (Shell wrappers were added/ the code tuned that all can be called as testname <filename> <searchstring> [...] and that an if can be used to check the result)

Results: (measured with time, real time rounded to closest half second)

multimatch: 49s
multimatch_regex: 55s
matchall: 10.5s
fileMatchesAllNames: 4s
awk (first version): 4s
agrep: 4.5s
Perl re (-r): 10.5s
Perl non-re: 9.5s
Perl non-re optimised: 5s (Removed Getopt::Std and regex support for faster startup)
Perl re optimised: 7s (Removed Getopt::Std and non-regex support for faster startup)
git grep: 3.5s
C version (no regex): 1.5s

(Invoking grep multiple times, especially with the recursive method, did better than I expected)

Check if all of multiple strings or regexes exist in a file

`git grep`

Benchmarks

Related

Recent Posts