Can I improve performance of this regular expression further

I am trying to fetch thread names from the thread dumps file. The thread names are usually contained within "double quotes" in the first line of each thread dump. It may look as simple as follows:

"THREAD1" daemon prio=10 tid=0x00007ff6a8007000 nid=0xd4b6 runnable [0x00007ff7f8aa0000]

Or as big as follows:

"[STANDBY] ExecuteThread: '43' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00007ff71803a000 nid=0xd3e7 in Object.wait() [0x00007ff7f8ae1000]

The regular expression I wrote is simple one: "(.*)". It captures everything inside double quotes as a group. However it causes heavy backtracking thus requiring a lot of steps, as can be seen here. Verbally we can explain this regex as "capture anything that is enclosed inside double quotes as a group"

So I came up with another regex which performs the same: "([^\"])". Verbally we can describe this regex as "capture any number of non-double quote characters that are enclosed inside double quotes". I did not found any fast regex than this. It does not perform any backtracking and hence it requires minimum steps as can be seen here.

I told this above to my colleague. He came up with yet another one: "(.*?)". I didnt get how it works. It performs considerable less backtracking than the first one but is a bit slower than the second one as can be seen here. However

  • I don't get why the backtracking stops early.
  • I understand ? is a quantifier which means once or not at all. However I dont understand how once or not at all is getting used here.
  • In fact I am not able to guess how can we describe this regex verbally.

My colleague tried explaining me but I am still not able to understand it completely. Can anyone explain?


Brief explanation and a solution

The "(.*)" regex involves a lot of backtracking because it finds the first " and then grabs the whole string and backtracks looking for the " that is closest to the end of string. Since you have a quoted substring closer to the start, there's more backtracking than with "(.*?)" as this lazy quantifier *? makes the regex engine look for the closest " after the first " found.

The negated character class solution "([^"]*)" is the best from the 3 because it does not have to grab everything, just all characters other than ". However, to stop any backtracking and make the expression ultimately efficient, you can use possessive quantifiers.

If you need to match strings like " + no quotes here + ", use

"([^"]*+)"

or even you do not need to match the trailing quote in this situation:

"([^"]*+)

See regex demo

In fact I am not able to guess how can we describe this regex verbally.

The latter "([^"]*+) regex can be described as

  • " - find the first " symbol from the left of the string
  • ([^"]*+) - match and capture into Group 1 zero or more symbols other than ", as many as possible, and once the engine finds a double quote, the match is returned immediately, without backtracking.

Quantifiers

More information on quantifiers from Rexegg.com:

A* Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A*? Zero or more As, as few as needed to allow the overall pattern to match (lazy)
A*+ Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)

As you see, ? is not a separate quantifier, it is a part of another quantifier.

I advise to read more about why Lazy Quantifiers are Expensive and that Negated Class Solution is really safe and fast to deal with your input string (where you just match a quote followed by non-quotes and then a final quote).

Difference between .*?, .* and [^"]*+ quantifiers

  • Greedy "(.*)" solution works like this: checks each symbol from left to right looking for ", and once found grabs the whole string up to the end and checks each symbol if it is equal to ". Thus, in your input string, it backtracks 160 times.

enter image description here

  • Lazy "(.*?)" solution works like this: the engine finds the first " and then advances in the pattern and tries the next token (which is ") against the T in THREAD1. This fails, so the engine backtracks and allows the .*? to expand its match by one item, so that it matches the T. Once again, the engine advances in the pattern. It now tries the " against the H in THREAD1. This fails, so the engine backtracks and allows the .*? to expand and match the H. The process then repeats itself—the engine advances, fails, backtracks, allows the lazy .*? to expand its match by one item, advances, fails and so on. For each character matched by the .*?, the engine has to backtrack. From a computing standpoint, this process of matching one item, advancing, failing, backtracking, expanding is "expensive".

Since the next " is not far, the number of backtrack steps is much fewer than with greedy matching.

enter image description here

  • possessive quantifier solution with a negated character class "([^"]*+)" works like this: the engine finds the leftmost ", and then grabs all characters that are not " up to the first ". The negated character class [^"]*+ greedily matches zero or more characters that are not a double quote. Therefore, we are guaranteed that the dot-star will never jump over the first encountered ". This is a more direct and efficient way of matching between some delimiters. Note that in this solution, we can fully trust the * that quantifies the [^"]. Even though it is greedy, there is no risk that [^"] will match too much as it is mutually exclusive with the ". This is the contrast principle from the regex style guide [see source].

Note that the possessive quantifier does not let the regex engine backtrack into the subexpression, once matched, the symbols between " become one hard block that cannot be "re-sorted" due to some "inconveniences" met by the regex engine, and it will be unable to shift any characters from and into this block of text.

For the current expression, it does not make a big difference though.

enter image description here