Using explicitly numbered repetition instead of question mark, star and plus
I've seen regex patterns that use explicitly numbered repetition instead of ?
, *
and +
, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
- Are these two forms identical? What if you add possessive/reluctant modifiers?
- If they are identical, which one is more idiomatic? More readable? Simply "better"?
Solution 1:
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
- Exactly two:
{2}
- Two or more:
{2,}
- Two to four:
{2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d
but I would rather write \d{3}
since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3}
to {n}
and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1}
and {0,1}
occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
Solution 2:
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m}
form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}
, {0,}
or {1,}
being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?
, *
or +
instead.
And of course, {1}
is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0
), but there are other ways to do that.