Backreferences Syntax in Replacement Strings (Why Dollar Sign?)
In Java, and it seems in a few other languages, backreferences in the pattern are preceded by a backslash (e.g. \1
, \2
, \3
, etc), but in a replacement string they preceded by a dollar sign (e.g. $1
, $2
, $3
, and also $0
).
Here's a snippet to illustrate:
System.out.println(
"left-right".replaceAll("(.*)-(.*)", "\\2-\\1") // WRONG!!!
); // prints "2-1"
System.out.println(
"left-right".replaceAll("(.*)-(.*)", "$2-$1") // CORRECT!
); // prints "right-left"
System.out.println(
"You want million dollar?!?".replaceAll("(\\w*) dollar", "US\\$ $1")
); // prints "You want US$ million?!?"
System.out.println(
"You want million dollar?!?".replaceAll("(\\w*) dollar", "US$ \\1")
); // throws IllegalArgumentException: Illegal group reference
Questions:
- Is the use of
$
for backreferences in replacement strings unique to Java? If not, what language started it? What flavors use it and what don't? - Why is this a good idea? Why not stick to the same pattern syntax? Wouldn't that lead to a more cohesive and an easier to learn language?
- Wouldn't the syntax be more streamlined if statements 1 and 4 in the above were the "correct" ones instead of 2 and 3?
Solution 1:
Is the use of $ for backreferences in replacement strings unique to Java?
No. Perl uses it, and Perl certainly predates Java's Pattern
class. Java's regex support is explicitly described in terms of Perl regexes.
For example: http://perldoc.perl.org/perlrequick.html#Search-and-replace
Why is this a good idea?
Well obviously you don't think it is a good idea! But one reason that it is a good idea is to make Java search/replace support (more) compatible with Perl's.
There is another possible reason why $
might have been viewed as a better choice than \
. That is that \
has to be written as \\
in a Java String literal.
But all of this is pure speculation. None of us were in the room when the design decisions were made. And ultimately it doesn't really matter why they designed the replacement String syntax that way. The decisions have been made and set in concrete, and any further discussion is purely academic ... unless you just happen to be designing a new language or a new regex library for Java.
Solution 2:
After doing some research, I've understood the issues now: Perl had to use a different symbol for pattern backreferences and replacement backreferences, and while java.util.regex.*
doesn't have to follow suit, it chooses to, not for a technical but rather traditional reason.
On the Perl side
(Please keep in mind that all I know about Perl at this point comes from reading Wikipedia articles, so feel free to correct any mistakes I may have made)
The reason why it had to be done this way in Perl is the following:
- Perl uses
$
as a sigil (i.e. a symbol attached to variable name). - Perl string literals are variable interpolated.
- Perl regex actually captures groups as variables
$1
,$2
, etc.
Thus, because of the way Perl is interpreted and how its regex engine works, a preceding slash for backreferences (e.g. \1
) in the pattern must be used, because if the sigil $
is used instead (e.g. $1
), it would cause unintended variable interpolation into the pattern.
The replacement string, due to how it works in Perl, is evaluated within the context of every match. It is most natural for Perl to use variable interpolation here, so the regex engine captures groups into variables $1
, $2
, etc, to make this work seamlessly with the rest of the language.
References
- Wikipedia/String literal - variable interpolation
- Wikipedia/Sigil (computer programming)
On the Java side
Java is a very different language than Perl, but most importantly here is that there is no variable interpolation. Moreover, replaceAll
is a method call, and as with all method calls in Java, arguments are evaluated once, prior to the method invoked.
Thus, variable interpolation feature by itself is not enough, since in essence the replacement string must be re-evaluated on every match, and that's just not the semantics of method calls in Java. A variable-interpolated replacement string that is evaluated before the replaceAll
is even invoked is practically useless; the interpolation needs to happen during the method, on every match.
Since that is not the semantics of Java language, replaceAll
must do this "just-in-time" interpolation manually. As such, there is absolutely no technical reason why $
is the escape symbol for backreferences in replacement strings. It could've very well been the \
. Conversely, backreferences in the pattern could also have been escaped with $
instead of \
, and it would've still worked just as fine technically.
The reason Java does regex the way it does is purely traditional: it's simply following the precedent set by Perl.