Split string to equal length substrings in Java

How to split the string "Thequickbrownfoxjumps" to substrings of equal size in Java. Eg. "Thequickbrownfoxjumps" of 4 equal size should give the output.

["Theq","uick","brow","nfox","jump","s"]

Similar Question:

Split string into equal-length substrings in Scala


Solution 1:

Here's the regex one-liner version:

System.out.println(Arrays.toString(
    "Thequickbrownfoxjumps".split("(?<=\\G.{4})")
));

\G is a zero-width assertion that matches the position where the previous match ended. If there was no previous match, it matches the beginning of the input, the same as \A. The enclosing lookbehind matches the position that's four characters along from the end of the last match.

Both lookbehind and \G are advanced regex features, not supported by all flavors. Furthermore, \G is not implemented consistently across the flavors that do support it. This trick will work (for example) in Java, Perl, .NET and JGSoft, but not in PHP (PCRE), Ruby 1.9+ or TextMate (both Oniguruma). JavaScript's /y (sticky flag) isn't as flexible as \G, and couldn't be used this way even if JS did support lookbehind.

I should mention that I don't necessarily recommend this solution if you have other options. The non-regex solutions in the other answers may be longer, but they're also self-documenting; this one's just about the opposite of that. ;)

Also, this doesn't work in Android, which doesn't support the use of \G in lookbehinds.

Solution 2:

Well, it's fairly easy to do this with simple arithmetic and string operations:

public static List<String> splitEqually(String text, int size) {
    // Give the list the right capacity to start with. You could use an array
    // instead if you wanted.
    List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);

    for (int start = 0; start < text.length(); start += size) {
        ret.add(text.substring(start, Math.min(text.length(), start + size)));
    }
    return ret;
}

Note: this assumes a 1:1 mapping of UTF-16 code unit (char, effectively) with "character". That assumption breaks down for characters outside the Basic Multilingual Plane, such as emoji, and (depending on how you want to count things) combining characters.

I don't think it's really worth using a regex for this.

EDIT: My reasoning for not using a regex:

  • This doesn't use any of the real pattern matching of regexes. It's just counting.
  • I suspect the above will be more efficient, although in most cases it won't matter
  • If you need to use variable sizes in different places, you've either got repetition or a helper function to build the regex itself based on a parameter - ick.
  • The regex provided in another answer firstly didn't compile (invalid escaping), and then didn't work. My code worked first time. That's more a testament to the usability of regexes vs plain code, IMO.