Question marks in regular expressions
I'm reading the regular expressions reference and I'm thinking about ? and ?? characters. Could you explain me with some examples their usefulness? I don't understand them enough.
thank you
Solution 1:
This is an excellent question, and it took me a while to see the point of the lazy ??
quantifier myself.
? - Optional (greedy) quantifier
The usefulness of ?
is easy enough to understand. If you wanted to find both http
and https
, you could use a pattern like this:
https?
This pattern will match both inputs, because it makes the s
optional.
?? - Optional (lazy) quantifier
??
is more subtle. It usually does the same thing ?
does. It doesn't change the true/false result when you ask: "Does this input satisfy this regex?" Instead, it's relevant to the question: "Which part of this input matches this regex, and which parts belong in which groups?" If an input could satisfy the pattern in more than one way, the engine will decide how to group it based on ?
vs. ??
(or *
vs. *?
, or +
vs. +?
).
Say you have a set of inputs that you want to validate and parse. Here's an (admittedly silly) example:
Input:
http123
https456
httpsomething
Expected result:
Pass/Fail Group 1 Group 2
Pass http 123
Pass https 456
Pass http something
You try the first thing that comes to mind, which is this:
^(http)([a-z\d]+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass http s456 No
Pass http something Yes
They all pass, but you can't use the second set of results because you only wanted 456
in Group 2.
Fine, let's try again. Let's say Group 2 can be letters or numbers, but not both:
(https?)([a-z]+|\d+)
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass https omething No
Now the second input is fine, but the third one is grouped wrong because ?
is greedy by default (the +
is too, but the ?
came first). When deciding whether the s
is part of https?
or [a-z]+|\d+
, if the result is a pass either way, the regex engine will always pick the one on the left. So Group 2 loses s
because Group 1 sucked it up.
To fix this, you make one tiny change:
(https??)([a-z]+|\d+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass http something Yes
Essentially, this means: "Match https
if you have to, but see if this still passes when Group 1 is just http
." The engine realizes that the s
could work as part of [a-z]+|\d+
, so it prefers to put it into Group 2.
Solution 2:
The key difference between ?
and ??
concerns their laziness. ??
is lazy, ?
is not.
Let's say you want to search for the word "car" in a body of text, but you don't want to be restricted to just the singular "car"; you also want to match against the plural "cars".
Here's an example sentence:
I own three cars.
Now, if I wanted to match the word "car" and I only wanted to get the string "car" in return, I would use the lazy ??
like so:
cars??
This says, "look for the word car or cars; if you find either, return car
and nothing more".
Now, if I wanted to match against the same words ("car" or "cars") and I wanted to get the whole match in return, I'd use the non-lazy ?
like so:
cars?
This says, "look for the word car or cars, and return either car or cars, whatever you find".
In the world of computer programming, lazy generally means "evaluating only as much as is needed". So the lazy ??
only returns as much as is needed to make a match; since the "s" in "cars" is optional, don't return it. On the flip side, non-lazy (sometimes called greedy) operations evaluate as much as possible, hence the ?
returns all of the match, including the optional "s".
Personally, I find myself using ?
as a way of making other regular expression operators lazy (like the *
and +
operators) more often than I use it for simple character optionality, but YMMV.
See it in Code
Here's the above implemented in Clojure as an example:
(re-find #"cars??" "I own three cars.")
;=> "car"
(re-find #"cars?" "I own three cars.")
;=> "cars"
The item re-find
is a function that takes its first argument as a regular expression #"cars??"
and returns the first match it finds in the second argument "I own three cars."
Solution 3:
Some Other Uses of Question marks in regular expressions
Apart from what's explained in other answers, there are still 3 more uses of Question Marks in regular expressions.
-
Negative Lookahead
Negative lookaheads are used if you want to match something not followed by something else. The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point.
x(?!x2)
example
- Consider a word
There
-
Now, by default, the RegEx
e
will find the third lettere
in wordThere
.There ^
-
However if you don't want the
e
which is immediately followed byr
, then you can use RegExe(?!r)
. Now the result would be:There ^
- Consider a word
-
Positive Lookahead
Positive lookahead works just the same.
q(?=u)
matches aq
that is immediately followed by au
, without making theu
part of the match. The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.example
- Consider a word
getting
-
Now, by default, the RegEx
t
will find the third lettert
in wordgetting
.getting ^
-
However if you want the
t
which is immediately followed byi
, then you can use RegExt(?=i)
. Now the result would be:getting ^
- Consider a word
-
Non-Capturing Groups
Whenever you place a Regular Expression in parenthesis
()
, they create a numbered capturing group. It stores the part of the string matched by the part of the regular expression inside the parentheses.If you do not need the group to capture its match, you can optimize this regular expression into
(?:Value)
See also this and this.
Solution 4:
?
simply makes the previous item (character, character class, group) optional:
colou?r
matches "color" and "colour"
(swimming )?pool
matches "a pool" and "the swimming pool"
??
is the same, but it's also lazy, so the item will be excluded if at all possible. As those docs note, ?? is rare in practice. I have never used it.
Solution 5:
Running the test harness from Oracle documentation with the reluctant quantifier of the "once or not at all" match X??
shows that it works as a guaranteed always-empty match.
$ java RegexTestHarness
Enter your regex: x?
Enter input string to search: xx
I found the text "x" starting at index 0 and ending at index 1.
I found the text "x" starting at index 1 and ending at index 2.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex: x??
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
It seems identical to the empty matcher.
Enter your regex:
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex:
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
Enter your regex: x??
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.