Fast algorithm for searching for substrings in a string
I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.
What I would like to do is:
Given an input string - INSTR:
"BCDEFGH"
And a set of candidate strings - CAND:
"AB", "CDE", "FG", "H", "IJ"
Find any CAND strings that match as substrings within INSTR
In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")
There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.
I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.
Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.
Update I should have mentioned, the set of CAND strings is invariant across all values of INSTR.
Update I only need to know that there was a match - and i don't need to know what the match was.
Final Update I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used this
In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were:
AhoCorsick: 1
RabinKarp: 1.8
Naive Search (check each pattern & use string.contains): 50
*Some resources describing the algos mentioned in the answers below:
http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html
http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf
http://www-igm.univ-mlv.fr/~lecroq/string/index.html*
Solution 1:
Read up on the Aho-Corasick algorithm and the Rabin-Karp algorithm.
If the input is not too large, you don't want to repeat the search many times and you do not have many patterns, it might be a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times.
Implementations:
- https://hkn.eecs.berkeley.edu/~dyoo/java/index.html
- http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html
- https://github.com/hankcs/AhoCorasickDoubleArrayTrie
- https://github.com/RokLenarcic/AhoCorasick
- https://github.com/robert-bor/aho-corasick
Presentations:
- http://www.slideshare.net/taka111/ahocorasick-string-matching-algorithm-15078438
Solution 2:
Convert the set of candidate strings into a deterministic finite state automaton and then run through the input string in linear time. Converting a single string into a DFS is well-covered in the standard books. You can convert a set of strings by first constructing a non-deterministic automaton and then determinizing it. That can create exponential blow-up in the worst case in the size of the automaton but the search afterwards is fast; especially if the target string is long and the candidates short that's going to work well.
Solution 3:
This is what regular expressions are for. As noted above, finite state automata are what you need, but that is exactly how a standard regexp-matcher is implemented.
In java you could write something like:
StringBuilder sb = new StringBuilder();
bool first = true;
for (String subStr : substrings) {
if (first)
first = false;
else
sb.append('|');
sb.append(escape(subStr));
}
Pattern p = Pattern.compile(sb.toString());
the method escape
should escape any characters which have special meanings in a regexp.
Solution 4:
Rabin-Karp multiple pattern search appears to be the fastest.