Why does the order of alternatives matter in regex?
Code
using System;
using System.Text.RegularExpressions;
namespace RegexNoMatch {
class Program {
static void Main () {
string input = "a foobar& b";
string regex1 = "(foobar|foo)&?";
string regex2 = "(foo|foobar)&?";
string replace = "$1";
Console.WriteLine(Regex.Replace(input, regex1, replace));
Console.WriteLine(Regex.Replace(input, regex2, replace));
Console.ReadKey();
}
}
}
Expected output
a foobar b
a foobar b
Actual output
a foobar b
a foobar& b
Question
Why does replacing not work when the order of "foo" and "foobar" in regex pattern is changed? How to fix this?
Solution 1:
The regular expression engine tries to match the alternatives in the order in which they are specified. So when the pattern is (foo|foobar)&?
it matches foo
immediately and continues trying to find matches. The next bit of the input string is bar& b
which cannot be matched.
In other words, because foo
is part of foobar
, there is no way (foo|foobar)
will ever match foobar
, since it will always match foo
first.
Occasionally, this can be a very useful trick, actually. The pattern (o|a|(\w))
will allow you to capture \w
and a
or o
differently:
Regex.Replace("a foobar& b", "(o|a|(\\w))", "$2") // fbr& b