Java regex to match start/end tags causes stack overflow
Solution 1:
Some more details on the origin of the stack overflow issue:
Sometimes the regex
Pattern
class will throw aStackOverflowError
. This is a manifestation of the known bug #5050507, which has been in thejava.util.regex
package since Java 1.4. The bug is here to stay because it has "won't fix" status. This error occurs because thePattern
class compiles a regular expression into a small program which is then executed to find a match. This program is used recursively, and sometimes when too many recursive calls are made this error occurs. See the description of the bug for more details. It seems it's triggered mostly by the use of alternations.
Your regex (that has alternations) is matching any 1+ characters between two tags.
You may either use a lazy dot matching pattern with the Pattern.DOTALL
modifier (or the equivalent embedded flag (?s)
) that will make the .
match newline symbols as well:
(?s)<Data>(?<data>.+?)</Data>
See this regex demo
However, lazy dot matching patterns still consume lots of memory in case of huge inputs. The best way out is to use an unroll-the-loop method:
<Data>(?<data>[^<]*(?:<(?!/?Data>)[^<]*)*)</Data>
See the regex demo
Details:
-
<Data>
- literal text<Data>
-
(?<data>
- start of the capturing group "data"-
[^<]*
- zero or more characters other than<
-
(?:<(?!/?Data>)[^<]*)*
- 0 or more sequences of:-
<(?!/?Data>)
- a<
that is not followed withData>
or/Data>
-
[^<]*
- zero or more characters other than<
-
-
-
)
- end of the "data" group -
</Data>
- closing delimiter