Matching an optional substring in a regex
I'm developing an algorithm to parse a number out of a series of short-ish strings. These strings are somewhat regular, but there's a few different general forms and several exceptions. I'm trying to build a set of regexes that will handle the various forms and exceptions; I'll apply them one after another to see if I get a match.
One of these forms goes something like this:
X (Y) Z
Where:
-
X
is a number I want to capture. -
Z
is static, pre-defined text. it's basically how I determine whether this particular form is applicable or not. -
Y
is a string of unknown length and content, surrounded by parenthesis.
Also: Y
is optional; it doesn't always appear in a string with Z
and X
. So, I want to be able to extract the numbers from all of these strings:
10 Z
20 (foo) Z
30 (bar) Z
Right now, I have a regex that will capture the first one:
([0-9]+) +Z
My problem is that I don't know how to construct a regex that will match a series of characters if and only if they're enclosed in parenthesis. Can this be done in a single regex?
(\d+)\s+(\(.*?\))?\s?Z
Note the escaped parentheses, and the ?
(zero or once) quantifiers. Any of the groups you don't want to capture can be (?:
non-capture groups).
I agree about the spaces. \s
is a better option there. I also changed the quantifier to insure there are digits at the beginning. As far as newlines, that would depend on context: if the file is parsed line by line it won't be a problem. Another option is to anchor the start and end of the line (add a ^
at the front and a $
at the end).
This ought to work:
^\d+\s?(\([^\)]+\)\s?)?Z$
Haven't tested it though, but let me give you the breakdown, so if there are any bugs left they should be pretty straightforward to find:
First the beginning:
^ = beginning of string
\d+ = one or more decimal characters
\s? = one optional whitespace
Then this part:
(\([^\)]+\)\s?)?
Is actually:
(.............)?
Which makes the following contents optional, only if it exists fully
\([^\)]+\)\s?
\( = an opening bracket
[^\)]+ = a series of at least one character that is not a closing bracket
\) = followed by a closing bracket
\s? = followed by one optional whitespace
And the end is made up of
Z$
Where
Z = your constant string
$ = the end of the string