Why is "asdf".replace(/.*/g, "x") == "xx"?
As per the ECMA-262 standard, String.prototype.replace calls RegExp.prototype[@@replace], which says:
11. Repeat, while done is false
a. Let result be ? RegExpExec(rx, S).
b. If result is null, set done to true.
c. Else result is not null,
i. Append result to the end of results.
ii. If global is false, set done to true.
iii. Else,
1. Let matchStr be ? ToString(? Get(result, "0")).
2. If matchStr is the empty String, then
a. Let thisIndex be ? ToLength(? Get(rx, "lastIndex")).
b. Let nextIndex be AdvanceStringIndex(S, thisIndex, fullUnicode).
c. Perform ? Set(rx, "lastIndex", nextIndex, true).
where rx
is /.*/g
and S
is 'asdf'
.
See 11.c.iii.2.b:
b. Let nextIndex be AdvanceStringIndex(S, thisIndex, fullUnicode).
Therefore in 'asdf'.replace(/.*/g, 'x')
it is actually:
- result (undefined), results =
[]
, lastIndex =0
- result =
'asdf'
, results =[ 'asdf' ]
, lastIndex =4
- result =
''
, results =[ 'asdf', '' ]
, lastIndex =4
,AdvanceStringIndex
, set lastIndex to5
- result =
null
, results =[ 'asdf', '' ]
, return
Therefore there are 2 matches.
Together in an offline chat with yawkat, we found an intuitive way of seeing why "abcd".replace(/.*/g, "x")
exactly produces two matches. Note that we haven't checked whether it completely equals the semantics imposed by the ECMAScript standard, hence just take it as a rule of thumb.
Rules of Thumb
- Consider the matches as a list of tuples
(matchStr, matchIndex)
in chronological order that indicate which string parts and indices of the input string have already been eaten up. - This list is continuously built up starting from the left of the input string for the regex.
- Parts already eaten up cannot be matched anymore
- Replacement is done at indices given by
matchIndex
overwriting the substringmatchStr
at that position. IfmatchStr = ""
, then the "replacement" is effectively insertion.
Formally, the act of matching and replacement is described as a loop as seen in the other answer.
Easy Examples
-
"abcd".replace(/.*/g, "x")
outputs"xx"
:-
The match list is
[("abcd", 0), ("", 4)]
Notably, it does not include the following matches one could have thought of for the following reasons:
-
("a", 0)
,("ab", 0)
: the quantifier*
is greedy -
("b", 1)
,("bc", 1)
: due to the previous match("abcd", 0)
, the strings"b"
and"bc"
are already eaten up -
("", 4), ("", 4)
(i.e. twice): the index position 4 is already eaten up by the first apparent match
-
-
Hence, the replacement string
"x"
replaces the found match strings exactly at those positions: at position 0 it replaces the string"abcd"
and at position 4 it replaces""
.Here you can see that replacement can act as true replacement of a previous string or just as insertion of a new string.
-
-
"abcd".replace(/.*?/g, "x")
with a lazy quantifier*?
outputs"xaxbxcxdx"
-
The match list is
[("", 0), ("", 1), ("", 2), ("", 3), ("", 4)]
In contrast to the previous example, here
("a", 0)
,("ab", 0)
,("abc", 0)
, or even("abcd", 0)
are not included due to the quantifier's laziness that strictly limits it to find the shortest possible match. Since all match strings are empty, no actual replacement occurs, but instead insertions of
x
at positions 0, 1, 2, 3, and 4.
-
-
"abcd".replace(/.+?/g, "x")
with a lazy quantifier+?
outputs"xxxx"
- The match list is
[("a", 0), ("b", 1), ("c", 2), ("d", 3)]
- The match list is
-
"abcd".replace(/.{2,}?/g, "x")
with a lazy quantifier[2,}?
outputs"xx"
- The match list is
[("ab", 0), ("cd", 2)]
- The match list is
"abcd".replace(/.{0}/g, "x")
outputs"xaxbxcxdx"
by the same logic as in example 2.
Harder Examples
We can consistently exploit the idea of insertion instead of replacement if we just always match an empty string and control the position where such matches happen to our advantage. For example, we can create regular expressions matching the empty string at every even position to insert a character there:
-
"abcdefgh".replace(/(?<=^(..)*)/g, "_"))
with a positive lookbehind(?<=...)
outputs"_ab_cd_ef_gh_"
(only supported in Chrome so far)- The match list is
[("", 0), ("", 2), ("", 4), ("", 6), ("", 8)]
- The match list is
-
"abcdefgh".replace(/(?=(..)*$)/g, "_"))
with a positive lookahead(?=...)
outputs"_ab_cd_ef_gh_"
- The match list is
[("", 0), ("", 2), ("", 4), ("", 6), ("", 8)]
- The match list is
The first match is obviously "asdf"
(Position [0,4]). Because the global flag (g
) is set, it continues searching. At this point (Position 4), it finds a second match, an empty string (Position [4,4]).
Remember that *
matches zero or more elements.
simply, the first x
is for the replacement of matching asdf
.
second x
for the empty string after asdf
. Search terminates when empty.