Why does the order of alternatives matter in regex?


using System;
using System.Text.RegularExpressions;

namespace RegexNoMatch {
    class Program {
        static void Main () {
            string input = "a foobar& b";
            string regex1 = "(foobar|foo)&?";
            string regex2 = "(foo|foobar)&?";
            string replace = "$1";
            Console.WriteLine(Regex.Replace(input, regex1, replace));
            Console.WriteLine(Regex.Replace(input, regex2, replace));

Expected output

a foobar b
a foobar b

Actual output

a foobar b
a foobar& b


Why does replacing not work when the order of "foo" and "foobar" in regex pattern is changed? How to fix this?

Solution 1:

The regular expression engine tries to match the alternatives in the order in which they are specified. So when the pattern is (foo|foobar)&? it matches foo immediately and continues trying to find matches. The next bit of the input string is bar& b which cannot be matched.

In other words, because foo is part of foobar, there is no way (foo|foobar) will ever match foobar, since it will always match foo first.

Occasionally, this can be a very useful trick, actually. The pattern (o|a|(\w)) will allow you to capture \w and a or o differently:

Regex.Replace("a foobar& b", "(o|a|(\\w))", "$2") // fbr& b