Java Replacing multiple different substring in a string at once (or in the most efficient way)

I need to replace many different sub-string in a string in the most efficient way. is there another way other then the brute force way of replacing each field using string.replace ?


Solution 1:

If the string you are operating on is very long, or you are operating on many strings, then it could be worthwhile using a java.util.regex.Matcher (this requires time up-front to compile, so it won't be efficient if your input is very small or your search pattern changes frequently).

Below is a full example, based on a list of tokens taken from a map. (Uses StringUtils from Apache Commons Lang).

Map<String,String> tokens = new HashMap<String,String>();
tokens.put("cat", "Garfield");
tokens.put("beverage", "coffee");

String template = "%cat% really needs some %beverage%.";

// Create pattern of the format "%(cat|beverage)%"
String patternString = "%(" + StringUtils.join(tokens.keySet(), "|") + ")%";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(template);

StringBuffer sb = new StringBuffer();
while(matcher.find()) {
    matcher.appendReplacement(sb, tokens.get(matcher.group(1)));
}
matcher.appendTail(sb);

System.out.println(sb.toString());

Once the regular expression is compiled, scanning the input string is generally very quick (although if your regular expression is complex or involves backtracking then you would still need to benchmark in order to confirm this!)

Solution 2:

Algorithm

One of the most efficient ways to replace matching strings (without regular expressions) is to use the Aho-Corasick algorithm with a performant Trie (pronounced "try"), fast hashing algorithm, and efficient collections implementation.

Simple Code

A simple solution leverages Apache's StringUtils.replaceEach as follows:

  private String testStringUtils(
    final String text, final Map<String, String> definitions ) {
    final String[] keys = keys( definitions );
    final String[] values = values( definitions );

    return StringUtils.replaceEach( text, keys, values );
  }

This slows down on large texts.

Fast Code

Bor's implementation of the Aho-Corasick algorithm introduces a bit more complexity that becomes an implementation detail by using a façade with the same method signature:

  private String testBorAhoCorasick(
    final String text, final Map<String, String> definitions ) {
    // Create a buffer sufficiently large that re-allocations are minimized.
    final StringBuilder sb = new StringBuilder( text.length() << 1 );

    final TrieBuilder builder = Trie.builder();
    builder.onlyWholeWords();
    builder.removeOverlaps();

    final String[] keys = keys( definitions );

    for( final String key : keys ) {
      builder.addKeyword( key );
    }

    final Trie trie = builder.build();
    final Collection<Emit> emits = trie.parseText( text );

    int prevIndex = 0;

    for( final Emit emit : emits ) {
      final int matchIndex = emit.getStart();

      sb.append( text.substring( prevIndex, matchIndex ) );
      sb.append( definitions.get( emit.getKeyword() ) );
      prevIndex = emit.getEnd() + 1;
    }

    // Add the remainder of the string (contains no more matches).
    sb.append( text.substring( prevIndex ) );

    return sb.toString();
  }

Benchmarks

For the benchmarks, the buffer was created using randomNumeric as follows:

  private final static int TEXT_SIZE = 1000;
  private final static int MATCHES_DIVISOR = 10;

  private final static StringBuilder SOURCE
    = new StringBuilder( randomNumeric( TEXT_SIZE ) );

Where MATCHES_DIVISOR dictates the number of variables to inject:

  private void injectVariables( final Map<String, String> definitions ) {
    for( int i = (SOURCE.length() / MATCHES_DIVISOR) + 1; i > 0; i-- ) {
      final int r = current().nextInt( 1, SOURCE.length() );
      SOURCE.insert( r, randomKey( definitions ) );
    }
  }

The benchmark code itself (JMH seemed overkill):

long duration = System.nanoTime();
final String result = testBorAhoCorasick( text, definitions );
duration = System.nanoTime() - duration;
System.out.println( elapsed( duration ) );

1,000,000 : 1,000

A simple micro-benchmark with 1,000,000 characters and 1,000 randomly-placed strings to replace.

  • testStringUtils: 25 seconds, 25533 millis
  • testBorAhoCorasick: 0 seconds, 68 millis

No contest.

10,000 : 1,000

Using 10,000 characters and 1,000 matching strings to replace:

  • testStringUtils: 1 seconds, 1402 millis
  • testBorAhoCorasick: 0 seconds, 37 millis

The divide closes.

1,000 : 10

Using 1,000 characters and 10 matching strings to replace:

  • testStringUtils: 0 seconds, 7 millis
  • testBorAhoCorasick: 0 seconds, 19 millis

For short strings, the overhead of setting up Aho-Corasick eclipses the brute-force approach by StringUtils.replaceEach.

A hybrid approach based on text length is possible, to get the best of both implementations.

Implementations

Consider comparing other implementations for text longer than 1 MB, including:

  • https://github.com/RokLenarcic/AhoCorasick
  • https://github.com/hankcs/AhoCorasickDoubleArrayTrie
  • https://github.com/raymanrt/aho-corasick
  • https://github.com/ssundaresan/Aho-Corasick
  • https://github.com/jmhsieh/aho-corasick
  • https://github.com/quest-oss/Mensa

Papers

Papers and information relating to the algorithm:

  • http://www.cs.uku.fi/research/publications/reports/A-2005-2.pdf
  • https://pdfs.semanticscholar.org/3547/ac839d02f6efe3f6f76a8289738a22528442.pdf
  • http://www.ece.ncsu.edu/asic/ece792A/2009/ECE792A/Readings_files/00989753.pdf
  • http://blog.ivank.net/aho-corasick-algorithm-in-as3.html