Why does a simple .*? non-greedy regex greedily include additional characters before a match?

I have a very simple regex similar to this:

HOHO.*?_HO_

With this test string...

fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_fbguyev

  • I expect it to match just _HOHO___HO_ (shortest match, non-greedy)
  • Instead it matches _HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_ (longest match, looks greedy).

Why? How can I make it match the shortest match?

Adding and removing the ? gives the same result.

Edit - better test string that shows why [^HOHO] doesn't work: fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye


All I can think of is that maybe it is matching multiple times - but there's only one match for _HO_, so I don't understand why it isn't taking the shortest match that ends at the _HO_, discarding the rest.

I've browsed all the questions I can find with titles like "Non-greedy regex acts greedy", but they all seem to have some other problem.


Solution 1:

I figured out a solution with some help from Regex lazy vs greedy confusion.

In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right - from the first left-hand match that fits to the nearest right-hand match.

Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).

Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".


You can approximate a regex that is non-greedy in both directions by replacing the . with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into (?:(?!).). For example, (?:(?!HOHO).)

For example, the equivalent of HOHO.*?_HO_ which is non-greedy on the left and right would be:

HOHO(?:(?!HOHO).)*?_HO_

So the regex engine is essentially going through each character like this:

  • HOHO - Does this match the left side?
  • (?:(?!HOHO).)* - If so, can I reach the right-hand side without any repeats of the left side?
  • _HO_ - If so, grab everything until the right-hand match
  • ? modifier on * or + - If there are multiple right-hand matches, choose the nearest one

Solution 2:

Why it matches the whole string?

This is because regular-expression pattern matching is done by finding the first position in the string at which a match is possible. Since a match is possible starting at the first character of the string, shorter matches starting at subsequent characters are never even considered.

Example:
Let's consider a regular expression /a+?b/ and test string "aaaaaaaaab". When applied to the string it matches the whole string. Not just last a & b. This is because the first position in the string where a match is possible is at the first a.

So, if you want to match ab in aaaaaaaaab, use a negated character class based regex rather than a lazy dot:

a[^ab]*b

See the regex demo.

Source: Javascript: The Definitive Guide, Sixth Edition, Page Number: 255

Solution 3:

The result is non-greedy, because it's the shortest match from the first occurrence of HOHO until _HO_ is reached; the engine traverses the string from left to right and because it doesn't have to backtrack, it won't attempt to shorten anything.

To make it work in the way that's expected here, you need to have a greedy prefix in your expression:

/.*(HOHO.*?_HO_)/

The first memory capture contains the string that you're after; the greedy prefix will try to skip as many characters as possible, so it will match the last occurrence of HOHO first.