How to match "anything up until this sequence of characters" in a regular expression?
You didn't specify which flavor of regex you're using, but this will work in any of the most popular ones that can be considered "complete".
/.+?(?=abc)/
How it works
The .+?
part is the un-greedy version of .+
(one or more of
anything). When we use .+
, the engine will basically match everything.
Then, if there is something else in the regex it will go back in steps
trying to match the following part. This is the greedy behavior,
meaning as much as possible to satisfy.
When using .+?
, instead of matching all at once and going back for
other conditions (if any), the engine will match the next characters by
step until the subsequent part of the regex is matched (again if any).
This is the un-greedy, meaning match the fewest possible to
satisfy.
/.+X/ ~ "abcXabcXabcX" /.+/ ~ "abcXabcXabcX"
^^^^^^^^^^^^ ^^^^^^^^^^^^
/.+?X/ ~ "abcXabcXabcX" /.+?/ ~ "abcXabcXabcX"
^^^^ ^
Following that we have (?=
{contents}
)
, a zero width
assertion, a look around. This grouped construction matches its
contents, but does not count as characters matched (zero width). It
only returns if it is a match or not (assertion).
Thus, in other terms the regex /.+?(?=abc)/
means:
Match any characters as few as possible until a "abc" is found, without counting the "abc".
If you're looking to capture everything up to "abc":
/^(.*?)abc/
Explanation:
( )
capture the expression inside the parentheses for access using $1
, $2
, etc.
^
match start of line
.*
match anything, ?
non-greedily (match the minimum number of characters required) - [1]
[1] The reason why this is needed is that otherwise, in the following string:
whatever whatever something abc something abc
by default, regexes are greedy, meaning it will match as much as possible. Therefore /^.*abc/
would match "whatever whatever something abc something ". Adding the non-greedy quantifier ?
makes the regex only match "whatever whatever something ".
As @Jared Ng and @Issun pointed out, the key to solve this kind of RegEx like "matching everything up to a certain word or substring" or "matching everything after a certain word or substring" is called "lookaround" zero-length assertions. Read more about them here.
In your particular case, it can be solved by a positive look ahead: .+?(?=abc)
A picture is worth a thousand words. See the detail explanation in the screenshot.
What you need is look around assertion like .+? (?=abc)
.
See: Lookahead and Lookbehind Zero-Length Assertions
Be aware that [abc]
isn't the same as abc
. Inside brackets it's not a string - each character is just one of the possibilities. Outside the brackets it becomes the string.
For regex in Java, and I believe also in most regex engines, if you want to include the last part this will work:
.+?(abc)
For example, in this line:
I have this very nice senabctence
select all characters until "abc" and also include abc
using our regex, the result will be: I have this very nice senabc
Test this out: https://regex101.com/r/mX51ru/1