Is there a way to split strings with String.split() and include the delimiters? [duplicate]

I have a multiline string which is delimited by a set of different delimiters:

(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)

I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.

In other words, this is what I get:

  • Text1
  • Text2
  • Text3
  • Text4

This is what I want

  • Text1
  • DelimiterA
  • Text2
  • DelimiterC
  • Text3
  • DelimiterB
  • Text4

Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?


You can use lookahead and lookbehind, which are features of regular expressions.

System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));

And you will get:

[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]

The last one is what you want.

((?<=;)|(?=;)) equals to select an empty character before ; or after ;.

EDIT: Fabian Steeg's comments on readability is valid. Readability is always a problem with regular expressions. One thing I do to make regular expressions more readable is to create a variable, the name of which represents what the regular expression does. You can even put placeholders (e.g. %1$s) and use Java's String.format to replace the placeholders with the actual string you need to use; for example:

static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";

public void someMethod() {
    final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
    ...
}

You want to use lookarounds, and split on zero-width matches. Here are some examples:

public class SplitNDump {
    static void dump(String[] arr) {
        for (String s : arr) {
            System.out.format("[%s]", s);
        }
        System.out.println();
    }
    public static void main(String[] args) {
        dump("1,234,567,890".split(","));
        // "[1][234][567][890]"
        dump("1,234,567,890".split("(?=,)"));   
        // "[1][,234][,567][,890]"
        dump("1,234,567,890".split("(?<=,)"));  
        // "[1,][234,][567,][890]"
        dump("1,234,567,890".split("(?<=,)|(?=,)"));
        // "[1][,][234][,][567][,][890]"

        dump(":a:bb::c:".split("(?=:)|(?<=:)"));
        // "[][:][a][:][bb][:][:][c][:]"
        dump(":a:bb::c:".split("(?=(?!^):)|(?<=:)"));
        // "[:][a][:][bb][:][:][c][:]"
        dump(":::a::::b  b::c:".split("(?=(?!^):)(?<!:)|(?!:)(?<=:)"));
        // "[:::][a][::::][b  b][::][c][:]"
        dump("a,bb:::c  d..e".split("(?!^)\\b"));
        // "[a][,][bb][:::][c][  ][d][..][e]"

        dump("ArrayIndexOutOfBoundsException".split("(?<=[a-z])(?=[A-Z])"));
        // "[Array][Index][Out][Of][Bounds][Exception]"
        dump("1234567890".split("(?<=\\G.{4})"));   
        // "[1234][5678][90]"

        // Split at the end of each run of letter
        dump("Boooyaaaah! Yippieeee!!".split("(?<=(?=(.)\\1(?!\\1))..)"));
        // "[Booo][yaaaa][h! Yipp][ieeee][!!]"
    }
}

And yes, that is triply-nested assertion there in the last pattern.

Related questions

  • Java split is eating my characters.
  • Can you use zero-width matching regex in String split?
  • How do I convert CamelCase into human-readable names in Java?
  • Backreferences in lookbehind

See also

  • regular-expressions.info/Lookarounds

A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):

string.replace(FullString, "," , "~,~")

Where you can replace tilda (~) with an appropriate unique delimiter.

Then if you do a split on your new delimiter then i believe you will get the desired result.