Regex that can match empty string is breaking the javascript regex engine
I wrote the following regex: /\D(?!.*\D)|^-?|\d+/g
I think it should work this way:
\D(?!.*\D) # match the last non-digit
| # or
^-? # match the start of the string with optional literal '-' character
| # or
\d+ # match digits
But, it doesn't:
var arrTest = '12,345,678.90'.match(/\D(?!.*\D)|^-?|\d+/g);
console.log(arrTest);
var test = arrTest.join('').replace(/[^\d-]/, '.');
console.log(test);
However, when playing it with PCRE(php)
-flavour online at Regex101. It works as I described.
I don't know if I think it should work one way it doesn't work. Or if there are some pattern not allowed in javascript regex-flavour.
JS works differently than PCRE. The point is that the JS regex engine does not handle zero-length matches well, the index is just manually incremented and the next character after a zero-length match is skipped. The ^-?
can match an empty string, and it matches the 12,345,678.90
start, skipping 1
.
If we have a look at the String#match
documentation, we will see that each call to match
with a global regex increases the regex object's lastIndex
after the zero-length match is found:
- Else, global is true
a. Call the [[Put]] internal method of rx with arguments "lastIndex" and 0.
b. Let A be a new array created as if by the expression new Array() where Array is the standard built-in constructor with that name.
c. Let previousLastIndex be 0.
d. Let n be 0.
e. Let lastMatch be true.
f. Repeat, while lastMatch is true
i. Let result be the result of calling the [[Call]] internal method of exec with rx as the this value and argument list containing S.
ii. If result is null, then set lastMatch to false.
iii. Else, result is not null
1. Let thisIndex be the result of calling the [[Get]] internal method of rx with argument "lastIndex".
2. If thisIndex = previousLastIndex then
a. Call the [[Put]] internal method of rx with arguments "lastIndex" and thisIndex+1.
b. Set previousLastIndex to thisIndex+1.
So, the matching process goes from 8a till 8f initializing the auxiliary structures, then a while block is entered (repeated until lastMatch is true, an internal exec command matches the empty space at the start of the string (8fi -> 8fiii), and as the result is not null, thisIndex is set to the lastIndex of the previous successful match, and as the match was zero-length (basically, thisIndex = previousLastIndex), the previousLastIndex is set to thisIndex+1 - which is skipping the current position after a successful zero-length match.
You may actually use a simpler regex inside a replace
method and use a callback to use appropriate replacements:
var res = '-12,345,678.90'.replace(/(\D)(?!.*\D)|^-|\D/g, function($0,$1) {
return $1 ? "." : "";
});
console.log(res);
Pattern details:
-
(\D)(?!.*\D)
- a non-digit (captured into Group 1) that is not followed with 0+ chars other than a newline and another non-digit -
|
- or -
^-
- a hyphen at the string start -
|
- or -
\D
- a non-digit
Note that here you do not even have to make the hyphen at the start optional.
You can reorder your alternation patterns and use this in JS to make it work:
var arrTest = '12,345,678.90'.match(/\D(?!.*\D)|\d+|^-?/g);
console.log(arrTest);
var test = arrTest.join('').replace(/\D/, '.');
console.log(test);
//=> 12345678.90
RegEx Demo
This is the difference between Javascript and PHP(PCRE) regex behavior.
In Javascript:
'12345'.match(/^|.+/gm)
//=> ["", "2345"]
In PHP:
preg_match_all('/^|.+/m', '12345', $m);
print_r($m);
Array
(
[0] => Array
(
[0] =>
[1] => 12345
)
)
So when you match ^
in Javascript, regex engine moves one position ahead and anything after alternation |
matches from 2nd position omwards in input.