Replace "\n" depending on how the line started

I am trying to replace line endings ('\n') with multiple characters ("<br>\n") (so no using tr) in a file, but only for certain lines, depending on how they start.

What I want :

Document in :

# Title

paragraph,
with text over multiple lines

- list item
- other list item
 - sublist item
 - sublist item 2

Output :

# Title

paragraph,<br>
with text over multiple lines<br>

- list item
- other list item
 - sublist item
 - sublist item 2

You probably guessed it, I am trying to force line break on single newlines (in paragraphs) when I later convert my markdown document to html.

What I tried/know

I have looked up the syntax of regexes and basics of the 'sed' command, so my understanding is that I need something that'd use a negative lookback to not match a particular beginning, then maybe a non-capturing group or positive lookback for the content of the line then the actual match on \n and the thing I want to replace with.

If you consider an example where I'd only exclude line starting with (1 or more '#' followed by a space) or ([maybe space] then [a dash] then [one space]), what I'm currently using is :

#DOESN'T WORK (and ~/Documents/test/foo contains exactly the example I put above)
sed -z 's/(?<!#+\s|\s*-\s)(?:[^\n]*)\n/<br>\n/g' ~/Documents/test/foo

What I understand of my command

Maybe my understanding of sed/regexes in a bash context is wrong so I'll explain how I understand what I wrote :

sed -z     # -z flag to treat the document as one big string, apparently good when
           # dealing with newlines replacement (https://linuxhint.com/newline_replace_sed/)
           # string with the command
'          #'
s/         # s(ubstitute) command in sed
(?<!#+\s|\s*-\s) # negative lookback ignoring '#+\s' (one or more '#' followed by a
                 # space) and '\s*-\s' (0 or more spaces, a dash then a space)
(?:[^\n]*) # non-capturing group matching the content of the line (literally anything
           # but newline, 0 or more times) because there is something between the
           # beginning I want to ignore and the ending I want to replace (hence group),
           # and I do not want this to be replaced (hence non-capturing).
\n         # the thing I'm matching on, and replacing
/          # separator to announce the replacement for the match
<br>\n     # replacement
/g         # g tag because I want to replace all matching occurences
'          #'
           # end of the command string

 ~/Documents/test/foo # my input/source

I've tried using grep -o that has a lighter syntax to show the matches (and try to correct my regex) but the '!' in the negative lookback keeps screwing things up and the -F tag doesn't seem to fix it.

Any help is appreciated, and as long as you can provide an example capable of ignoring either
"# Title\n" or " - list item\n" in replacing line ends, I'll figure out how to expand it.

PS

  1. Yes I know that leaving '<br>' on the last line of a paragraph will look bad but I'll fix that later, let's not make regexes insanely long in this small example.
  2. While it's true that my example is done in the command line, this is for use in a bash script (hence the tag), so answers should be compatible with bash or provide an explanation why they're not (I'm not extremely familiar with the differences but I read in here that some standarts aren't shared)
  3. My environment is Pop!_OS 21.10 (but that shouldn't matter, does it ?)

Thanks in advance


Solution 1:

Your regexp is using PCRE constructs (e.g. lookbacks) but sed doesn't support PCREs, just the POSIX regexp standards of BREs by default and EREs if called with -E in GNU or BSD sed.

I'd use awk for this for simplicity and portability. The following will work using any POSIX awk:

$ awk '{print $0 (/^[[:space:]]*([-#]|$)/ ? "" : "<br>")}' file
# Title

paragraph,<br>
with text over multiple lines<br>

- list item
- other list item
 - sublist item
 - sublist item 2

Original answer: $ sed 's/^[^-#].*/&
/' file # Title

paragraph,<br>
with text over multiple lines<br>

- list item
- other list item

If that's not all you need then edit your question to provide a more truly representative example that includes cases where the above doesn't work.