Whats the difference between \z and \Z in a regular expression and when and how do I use it?

Solution 1:

Even though \Z and $ only match at the end of the string (when the option for the caret and dollar to match at embedded line breaks is off), there is one exception. If the string ends with a line break, then \Z and $ will match at the position before that line break, rather than at the very end of the string.

This "enhancement" was introduced by Perl, and is copied by many regex flavors, including Java, .NET and PCRE. In Perl, when reading a line from a file, the resulting string will end with a line break. Reading a line from a file with the text "joe" results in the string joe\n. When applied to this string, both ^[a-z]+$ and \A[a-z]+\Z will match "joe".

If you only want a match at the absolute very end of the string, use \z (lower case z instead of upper case Z). \A[a-z]+\z does not match joe\n. \z matches after the line break, which is not matched by the character class.

http://www.regular-expressions.info/anchors.html

The way I read this "StackOverflow\n".matches("StackOverflow\\z") should return false because your pattern does not include the newline.

"StackOverflow\n".matches("StackOverflow\\z\\n") => false
"StackOverflow\n".matches("StackOverflow\\Z\\n") => true

Solution 2:

Just checked it. It looks like when Matcher.matches() is invoked(like in your code, behind the scenes), \Z behaves like \z. However, when Matcher.find() is invoked, they behave differently as expected. The following returns true:

Pattern p = Pattern.compile("StackOverflow\\Z");
Matcher m = p.matcher("StackOverflow\n");
System.out.println(m.find());

and if you replace \Z with \z it returns false.

I find this a little surprising...

Solution 3:

I think the main problem here is the unexpected behavior of matches(): any match must consume the whole input string. Both of your examples fail because the regexes don't consume the linefeed at the end of the string. The anchors have nothing to do with it.

In most languages, a regex match can occur anywhere, consuming all, some, or none of the input string. And Java has a method, Matcher#find(), that performs this traditional kind of match. However, the results are the opposite of what you said you expected:

Pattern.compile("StackOverflow\\z").matcher("StackOverflow\n").find()  //false
Pattern.compile("StackOverflow\\Z").matcher("StackOverflow\n").find()  //true

In the first example, the \z needs to match the end of the string, but the trailing linefeed is in the way. In the second, the \Z matches before the linefeed, which is at the end of the string.

Solution 4:

I think Alan Moore provided the best answer, especially the crucial point that matches silently inserts ^ and $ into its regex argument.

I'd also like to add a few examples. And a little more explanation.

\z matches only at the very end of the string.

\Z also matches at the very end of the string, but if there's a \n, it will match before it.

Consider this program:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        Pattern p = Pattern.compile(".+\\Z"); // some word before the end of the string
        String text = "one\ntwo\nthree\nfour\n";
        Matcher m = p.matcher(text);
        while (m.find()) {
            System.out.println(m.group());
        }
    }
}

It will find 1 match, and print "four".

Change \Z to \z, and it will not match anything, because it doesn't want to match before the \n.

However, this will also print four, because there's no \n at the end:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        Pattern p = Pattern.compile(".+\\z");
        String text = "one\ntwo\nthree\nfour";
        Matcher m = p.matcher(text);
        while (m.find()) {
            System.out.println(m.group());
        }
    }
}