Java String.split() sometimes giving blank strings

Digging through the source code, I got the exact issue behind this behaviour.

The String.split() method internally uses Pattern.split(). The split method before returning the resulting array checks for the last matched index or if there is actually a match. If the last matched index is 0, that means, your pattern matched just an empty string at the beginning of the string or didn't match at all, in which case, the returned array is a single element array containing the same element.

Here's the source code:

public String[] split(CharSequence input, int limit) {
        int index = 0;
        boolean matchLimited = limit > 0;
        ArrayList<String> matchList = new ArrayList<String>();
        Matcher m = matcher(input);

        // Add segments before each match found
        while(m.find()) {
            if (!matchLimited || matchList.size() < limit - 1) {
                String match = input.subSequence(index, m.start()).toString();
                matchList.add(match);

                // Consider this assignment. For a single empty string match
                // m.end() will be 0, and hence index will also be 0
                index = m.end();
            } else if (matchList.size() == limit - 1) { // last one
                String match = input.subSequence(index,
                                                 input.length()).toString();
                matchList.add(match);
                index = m.end();
            }
        }

        // If no match was found, return this
        if (index == 0)
            return new String[] {input.toString()};

        // Rest of them is not required

If the last condition in the above code - index == 0, is true, then the single element array is returned with the input string.

Now, consider the cases when the index can be 0.

  1. When there is no match at all. (As already in the comment above that condition)
  2. If the match is found at the beginning, and the length of matched string is 0, then the value of index in the if block (inside the while loop) -

    index = m.end();
    

    will be 0. The only possible match string is an empty string (length = 0). Which is exactly the case here. And also there shouldn't be any further matches, else index would be updated to a different index.

So, considering your cases:

  • For d%, there is just a single match for the pattern, before the first d. Hence the index value would be 0. But since there isn't any further matches, the index value is not updated, and the if condition becomes true, and returns the single element array with original string.

  • For d20+2 there would be two matches, one before d, and one before +. So index value will be updated, and hence the ArrayList in the above code will be returned, which contains the empty string as a result of split on delimiter which is the first character of the string, as already explained in @Stema's answer.

So, to get the behaviour you want (that is split on delimiter only when it is not at the beginning, you can add a negative look-behind in your regex pattern):

"(?<!^)(?=[dk+-])"  // You don't need to escape + and hyphen(when at the end)

this will split on empty string followed by your character class, but not preceded by the beginning of the string.


Consider the case of splitting the string "ad%" on regex pattern - "a(?=[dk+-])". This will give you an array with the first element as empty string. What the only change here is, the empty string is replaced with a:

"ad%".split("a(?=[dk+-])");  // Prints - `[, d%]`

Why? That's because the length of the matched string is 1. So the index value after the first match - m.end() wouldn't be 0 but 1, and hence the single element array won't be returned.


I was surprised that it does not happen for case 2 and 3, so the real question here is

Why is there NO empty string at the start for "d20" and "d%"?

as Rohit Jain explained in his detailed analyses, this happens, when there is only one match found at the start of the string and the match.end index is 0. (This can only happen, when only a lookaround assertion is used for finding the match).

The problem is, that d%+3 starts with a char you are splitting on. So your regex matches before the first character and you get an empty string at the start.

You can add a lookbehind, to ensure that your expression is not matching at the start of the string,so that it is not splitted there:

String[] tokens = message.split("(?<!^)(?=[dk\\+\\-])");

(?<!^) is a lookbehind assertion that is true, when it is not at the start of the string.