Using regexes, how to efficiently match strings between double quotes with embedded double quotes?
Let us have a text in which we want to match all strings between double quotes; but within these double quotes, there can be quoted double quotes. Example:
"He said \"Hello\" to me for the first time"
Using regexes, how do you match this efficiently?
Solution 1:
A very efficient solution to match such inputs is to use the normal* (special normal*)*
pattern; this name is quoted from the excellent book by Jeffrey Friedl, Mastering Regular Expressions.
It is a pattern useful in general to match inputs consisting of regular entries (the normal part) with separators inbetween (the special part).
Note that like all things regex, it should be used when there is no better choice; while one could use this pattern for parsing CSV data, for instance, if you use Java, you're better off using OpenCSV instead.
Also note that while the quantifiers in the pattern name are stars (ie, zero or more), you can vary them to suit your needs.
Strings with embedded double quotes
Let us take the above example again; and please consider that this text sample may be anywhere in your input:
"He said \"Hello\" to me for the first time"
No matter how hard you try, no amount of "dot plus greedy/lazy quantifiers" magic will help you solve it. Instead, categorize the input between quotes as normal and special:
- normal is anything but a backslash or a double quote:
[^\\"]
; - special is the sequence of a backslash followed by a double quote:
\\"
.
Substituting this into the normal* (special normal*)*
pattern, this gives the following regex:
[^\\"]*(\\"[^\\"]*)*
Adding the double quotes around to match the full text gives the final regex:
"[^\\"]*(\\"[^\\"]*)*"
You will note that this will also match empty quoted strings.
Words with dash separators
Here we will have to use a variant on the quantifiers, since:
- we don't want empty words,
- we don't want words starting with a dash,
- when a dash appears, it must have at least one letter before another dash, if any.
For simplicity, we will also suppose that only lowercase, ASCII letters are allowed.
Sample input:
the-word-to-match
Let us decompose again into normal and special:
- normal: a lowercase, ASCII letter:
[a-z]
; - special: the dash:
-
The canonical form of the pattern would be:
[a-z]*(-[a-z]*)*
But as we said:
- we don't want words starting with a dash: the first
*
should become+
; - when a dash is found, there should be at least one letter after it: the second
*
should become+
.
We end up with:
[a-z]+(-[a-z]+)*
Adding word anchors around it to obtain the final result:
\b[a-z]+(-[a-z]+)*\b
Other operator variations
The examples above limit themselves to replacing *
with +
, but of course you can have as many variations as you wish. One ultra classical example would be an IP address:
- normal is up to three digits (
\d{1,3}
), - special is the dot: (
\.
), - the first
normal
appears only once, therefore no quantifier, - the
normal
inside the(special normal*)
also appears only once, therefore no quantifier, - finally the
(special normal*)
part appears exactly three times, therefore{3}
.
Which gives the expresison (decorated with word anchors):
\b\d{1,3}(\.\d{1,3}){3}\b
Conclusion
This pattern's flexibility makes it one of the most useful tools in your regex toolbox. While many problems exist which you should not use regexes for if libraries exist, in some situations, you have to use regexes. And this will become one of your best friends once you have practiced with it a bit!
Tips
- It is more than likely that you don't need (or want) to capture the repeated part (the
(special normal*)
part); it is therefore recommended that you use a non-capturing group. For instance, use"[^\\"]*(?:\\"[^\\"]*)*"
for quoted strings. In fact, had you wanted it, capturing would almost never lead to the desired results in this case, because repeating a capturing group will only ever give you the last capture (all previous repetitions will be overwritten), unless you are using this pattern in .NET. (thanks @ohaal)