Replace "\n" depending on how the line started
I am trying to replace line endings ('\n') with multiple characters ("<br>\n") (so no using tr) in a file, but only for certain lines, depending on how they start.
What I want :
Document in :
# Title
paragraph,
with text over multiple lines
- list item
- other list item
- sublist item
- sublist item 2
Output :
# Title
paragraph,<br>
with text over multiple lines<br>
- list item
- other list item
- sublist item
- sublist item 2
You probably guessed it, I am trying to force line break on single newlines (in paragraphs) when I later convert my markdown document to html.
What I tried/know
I have looked up the syntax of regexes and basics of the 'sed' command, so my understanding is that I need something that'd use a negative lookback to not match a particular beginning, then maybe a non-capturing group or positive lookback for the content of the line then the actual match on \n and the thing I want to replace with.
If you consider an example where I'd only exclude line starting with (1 or more '#' followed by a space) or ([maybe space] then [a dash] then [one space]), what I'm currently using is :
#DOESN'T WORK (and ~/Documents/test/foo contains exactly the example I put above)
sed -z 's/(?<!#+\s|\s*-\s)(?:[^\n]*)\n/<br>\n/g' ~/Documents/test/foo
What I understand of my command
Maybe my understanding of sed/regexes in a bash context is wrong so I'll explain how I understand what I wrote :
sed -z # -z flag to treat the document as one big string, apparently good when
# dealing with newlines replacement (https://linuxhint.com/newline_replace_sed/)
# string with the command
' #'
s/ # s(ubstitute) command in sed
(?<!#+\s|\s*-\s) # negative lookback ignoring '#+\s' (one or more '#' followed by a
# space) and '\s*-\s' (0 or more spaces, a dash then a space)
(?:[^\n]*) # non-capturing group matching the content of the line (literally anything
# but newline, 0 or more times) because there is something between the
# beginning I want to ignore and the ending I want to replace (hence group),
# and I do not want this to be replaced (hence non-capturing).
\n # the thing I'm matching on, and replacing
/ # separator to announce the replacement for the match
<br>\n # replacement
/g # g tag because I want to replace all matching occurences
' #'
# end of the command string
~/Documents/test/foo # my input/source
I've tried using grep -o
that has a lighter syntax to show the matches (and try to correct my regex) but the '!' in the negative lookback keeps screwing things up and the -F tag doesn't seem to fix it.
Any help is appreciated, and as long as you can provide an example capable of ignoring either
"# Title\n
" or " - list item\n
" in replacing line ends, I'll figure out how to expand it.
PS
- Yes I know that leaving '<br>' on the last line of a paragraph will look bad but I'll fix that later, let's not make regexes insanely long in this small example.
- While it's true that my example is done in the command line, this is for use in a bash script (hence the tag), so answers should be compatible with bash or provide an explanation why they're not (I'm not extremely familiar with the differences but I read in here that some standarts aren't shared)
- My environment is Pop!_OS 21.10 (but that shouldn't matter, does it ?)
Thanks in advance
Solution 1:
Your regexp is using PCRE constructs (e.g. lookbacks) but sed doesn't support PCREs, just the POSIX regexp standards of BREs by default and EREs if called with -E
in GNU or BSD sed.
I'd use awk for this for simplicity and portability. The following will work using any POSIX awk:
$ awk '{print $0 (/^[[:space:]]*([-#]|$)/ ? "" : "<br>")}' file
# Title
paragraph,<br>
with text over multiple lines<br>
- list item
- other list item
- sublist item
- sublist item 2
Original answer:
$ sed 's/^[^-#].*/&
/' file
# Title
paragraph,<br>
with text over multiple lines<br>
- list item
- other list item
If that's not all you need then edit your question to provide a more truly representative example that includes cases where the above doesn't work.