Regular expressions: Ensuring b doesn't come between a and c
Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc
, 123
and xyz
that appear multiple times throughout the file.
I want a regular expression to match a substring of the big file that begins with abc
, contains 123
somewhere in the middle, ends with xyz
, and there are no other instances of abc
or xyz
in the substring besides the start and the end.
Is this possible with regular expressions?
When your left- and right-hand delimiters are single characters, it can be easily solved with negated character classes. So, if your match is between a
and c
and should not contain b
(literally), you may use (demo)
a[^abc]*c
This is the same technique you use when you want to make sure there is a b
in between the closest a
and c
(demo):
a[^abc]*b[^ac]*c
When your left- and right-hand delimiters are multi-character strings, you need a tempered greedy token:
abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz
See the regex demo
To make sure it matches across lines, use re.DOTALL
flag when compiling the regex.
Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.
Pattern details:
-
abc
- matchabc
-
(?:(?!abc|xyz|123).)*
- match any character that is not the starting point for aabc
,xyz
or123
character sequences -
123
- a literal string123
-
(?:(?!abc|xyz).)*
- any character that is not the starting point for aabc
orxyz
character sequences -
xyz
- a trailing substringxyz
See the diagram below (if re.S
is used, .
will mean AnyChar
):
See the Python demo:
import re
p = re.compile(r'abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz', re.DOTALL)
s = "abc 123 xyz\nabc abc 123 xyz\nabc text 123 xyz\nabc text xyz xyz"
print(p.findall(s))
// => ['abc 123 xyz', 'abc 123 xyz', 'abc text 123 xyz']
Using PCRE a solution would be:
This using m
flag. If you want to check only from start and end of a line add ^
and $
at beginning and end respectively
abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz
Debuggex Demo
The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:
where val like 'abc%123%xyz' and
val not like 'abc%abc%' and
val not like '%xyz%xyz'
I imagine something quite similar is simple to do in other environments.
You could use lookaround.
/^abc(?!.*abc).*123.*(?<!xyz.*)xyz$/g
(I've not tested it.)