Regex to replace all \n in a String, but no those inside [code] [/code] tag

I need help to replace all \n (new line) caracters for
in a String, but not those \n inside [code][/code] tags. My brain is burning, I can't solve this by my own :(

Example:

test test test
test test test
test
test

[code]some
test
code
[/code]

more text

Should be:

test test test<br />
test test test<br />
test<br />
test<br />
<br />
[code]some
test
code
[/code]<br />
<br />
more text<br />

Thanks for your time. Best regards.

I would suggest a (simple) parser, and not a regular expression. Something like this (bad pseudocode):

stack elementStack;

foreach(char in string) {
    if(string-from-char == "[code]") {
        elementStack.push("code");
        string-from-char = "";
    }

    if(string-from-char == "[/code]") {
        elementStack.popTo("code");
        string-from-char = "";
    }

    if(char == "\n" && !elementStack.contains("code")) {
        char = "<br/>\n";
    }
}

You've tagged the question regex, but this may not be the best tool for the job.

You might be better using basic compiler building techniques (i.e. a lexer feeding a simple state machine parser).

Your lexer would identify five tokens: ("[code]", '\n', "[/code]", EOF, :all other strings:) and your state machine looks like:

state    token    action
------------------------
begin    :none:   --> out
out      [code]   OUTPUT(token), --> in
out      \n       OUTPUT(break), OUTPUT(token)
out      *        OUTPUT(token)
in       [/code]  OUTPUT(token), --> out
in       *        OUTPUT(token)
*        EOF      --> end

EDIT: I see other poster discussing the possible need for nesting the blocks. This state machine won't handle that. For nesting blocks, use a recursive decent parser (not quite so simple but still easy enough and extensible).

EDIT: Axeman notes that this design excludes the use of "[/code]" in the code. An escape mechanism can be used to beat this. Something like add '\' to your tokens and add:

state    token    action
------------------------
in       \        -->esc-in
esc-in   *        OUTPUT(token), -->in
out      \        -->esc-out
esc-out  *        OUTPUT(token), -->out

to the state machine.

The usual arguments in favor of machine generated lexers and parsers apply.

This seems to do it:

private final static String PATTERN = "\\*+";

public static void main(String args[]) {
    Pattern p = Pattern.compile("(.*?)(\\[/?code\\])", Pattern.DOTALL);
    String s = "test 1 ** [code]test 2**blah[/code] test3 ** blah [code] test * 4 [code] test 5 * [/code] * test 6[/code] asdf **";
    Matcher m = p.matcher(s);
    StringBuffer sb = new StringBuffer(); // note: it has to be a StringBuffer not a StringBuilder because of the Pattern API
    int codeDepth = 0;
    while (m.find()) {
        if (codeDepth == 0) {
            m.appendReplacement(sb, m.group(1).replaceAll(PATTERN, ""));
        } else {
            m.appendReplacement(sb, m.group(1));
        }
        if (m.group(2).equals("[code]")) {
            codeDepth++;
        } else {
            codeDepth--;
        }
        sb.append(m.group(2));
    }
    if (codeDepth == 0) {
        StringBuffer sb2 = new StringBuffer();
        m.appendTail(sb2);
        sb.append(sb2.toString().replaceAll(PATTERN, ""));
    } else {
        m.appendTail(sb);
    }
    System.out.printf("Original: %s%n", s);
    System.out.printf("Processed: %s%n", sb);
}

Its not a straightforward regex but I don't think you can do what you want with a straightforward regex. Not with handling nested elements and so forth.

As mentioned by other posters, regular expressions are not the best tool for the job because they are almost universally implemented as greedy algorithms. This means that even if you tried to match code blocks using something like:

(\[code\].*\[/code\])

Then the expression will match everything from the first [code] tag to the last [/code] tag, which is clearly not what you want. While there are ways to get around this, the resulting regular expressions are usually brittle, unintuitive, and downright ugly. Something like the following python code would work much better.

output = []
def add_brs(str):
    return str.replace('\n','<br/>\n')
# the first block will *not* have a matching [/code] tag
blocks = input.split('[code]')
output.push(add_brs(blocks[0]))
# for all the rest of the blocks, only add <br/> tags to
# the segment after the [/code] segment
for block in blocks[1:]:
    if len(block.split('[/code]'))!=1:
        raise ParseException('Too many or few [/code] tags')
    else:
        # the segment in the code block is pre, everything
        # after is post
        pre, post = block.split('[/code]')
        output.push(pre)
        output.push(add_brs(post))
# finally join all the processed segments together
output = "".join(output)

Note the above code was not tested, it's just a rough idea of what you'll need to do.

Regex to replace all \n in a String, but no those inside [code] [/code] tag

Related

Recent Posts