Is it possible to escape regex metacharacters reliably with sed
I'm wondering whether it is possible to write a 100% reliable sed
command to escape any regex metacharacters in an input string so that it can be used in a subsequent sed command. Like this:
#!/bin/bash
# Trying to replace one regex by another in an input file with sed
search="/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3"
replace="/xyz\n\t[0-9]\+\([^ ]\)\{2,3\}\3"
# Sanitize input
search=$(sed 'script to escape' <<< "$search")
replace=$(sed 'script to escape' <<< "$replace")
# Use it in a sed command
sed "s/$search/$replace/" input
I know that there are better tools to work with fixed strings instead of patterns, for example awk
, perl
or python
. I would just like to prove whether it is possible or not with sed
. I would say let's concentrate on basic POSIX regexes to have even more fun! :)
I have tried a lot of things but anytime I could find an input which broke my attempt. I thought keeping it abstract as script to escape
would not lead anybody into the wrong direction.
Btw, the discussion came up here. I thought this could be a good place to collect solutions and probably break and/or elaborate them.
Solution 1:
Note:
-
If you're looking for prepackaged functionality based on the techniques discussed in this answer:
-
bash
functions that enable robust escaping even in multi-line substitutions can be found at the bottom of this post (plus aperl
solution that usesperl
's built-in support for such escaping). -
@EdMorton's answer contains a tool (
bash
script) that robustly performs single-line substitutions.- Ed's answer now has an improved version of the
sed
command used below, which is needed if you want to escape string literals for potential use with other regex-processing tools, such asawk
andperl
. In short: for cross-tool use,\
must be escaped as\\
rather than as[\]
, which means: instead of thesed 's/[^^]/[&]/g; s/\^/\\^/g'
command used below, you must usesed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'
- Ed's answer now has an improved version of the
-
-
All snippets assume
bash
as the shell (POSIX-compliant reformulations are possible):
SINGLE-line Solutions
Escaping a string literal for use as a regex in sed
:
To give credit where credit is due: I found the regex used below in this answer.
Assuming that the search string is a single-line string:
search='abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3' # sample input containing metachars.
searchEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$search") # escape it.
sed -n "s/$searchEscaped/foo/p" <<<"$search" # if ok, echoes 'foo'
- Every character except
^
is placed in its own character set[...]
expression to treat it as a literal.- Note that
^
is the one char. you cannot represent as[^]
, because it has special meaning in that location (negation).
- Note that
- Then,
^
chars. are escaped as\^
.- Note that you cannot just escape every char by putting a
\
in front of it because that can turn a literal char into a metachar, e.g.\<
and\b
are word boundaries in some tools,\n
is a newline,\{
is the start of a RE interval like\{1,3\}
, etc.
- Note that you cannot just escape every char by putting a
The approach is robust, but not efficient.
The robustness comes from not trying to anticipate all special regex characters - which will vary across regex dialects - but to focus on only 2 features shared by all regex dialects:
- the ability to specify literal characters inside a character set.
- the ability to escape a literal
^
as\^
Escaping a string literal for use as the replacement string in sed
's s///
command:
The replacement string in a sed
s///
command is not a regex, but it recognizes placeholders that refer to either the entire string matched by the regex (&
) or specific capture-group results by index (\1
, \2
, ...), so these must be escaped, along with the (customary) regex delimiter, /
.
Assuming that the replacement string is a single-line string:
replace='Laurel & Hardy; PS\2' # sample input containing metachars.
replaceEscaped=$(sed 's/[&/\]/\\&/g' <<<"$replace") # escape it
sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar" # if ok, outputs $replace as is
MULTI-line Solutions
Escaping a MULTI-LINE string literal for use as a regex in sed
:
Note: This only makes sense if multiple input lines (possibly ALL) have been read before attempting to match.
Since tools such as sed
and awk
operate on a single line at a time by default, extra steps are needed to make them read more than one line at a time.
# Define sample multi-line literal.
search='/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3
/def\n\t[A-Z]\+\([^ ]\)\{3,4\}\4'
# Escape it.
searchEscaped=$(sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$search" | tr -d '\n') #'
# Use in a Sed command that reads ALL input lines up front.
# If ok, echoes 'foo'
sed -n -e ':a' -e '$!{N;ba' -e '}' -e "s/$searchEscaped/foo/p" <<<"$search"
- The newlines in multi-line input strings must be translated to
'\n'
strings, which is how newlines are encoded in a regex. -
$!a\'$'\n''\\n'
appends string'\n'
to every output line but the last (the last newline is ignored, because it was added by<<<
) -
tr -d '\n
then removes all actual newlines from the string (sed
adds one whenever it prints its pattern space), effectively replacing all newlines in the input with'\n'
strings.
-
-e ':a' -e '$!{N;ba' -e '}'
is the POSIX-compliant form of ased
idiom that reads all input lines a loop, therefore leaving subsequent commands to operate on all input lines at once.- If you're using GNU
sed
(only), you can use its-z
option to simplify reading all input lines at once:sed -z "s/$searchEscaped/foo/" <<<"$search"
- If you're using GNU
Escaping a MULTI-LINE string literal for use as the replacement string in sed
's s///
command:
# Define sample multi-line literal.
replace='Laurel & Hardy; PS\2
Masters\1 & Johnson\2'
# Escape it for use as a Sed replacement string.
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$replace")
replaceEscaped=${REPLY%$'\n'}
# If ok, outputs $replace as is.
sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar"
- Newlines in the input string must be retained as actual newlines, but
\
-escaped. -
-e ':a' -e '$!{N;ba' -e '}'
is the POSIX-compliant form of ased
idiom that reads all input lines a loop. -
's/[&/\]/\\&/g
escapes all&
,\
and/
instances, as in the single-line solution. -
s/\n/\\&/g'
then\
-prefixes all actual newlines. -
IFS= read -d '' -r
is used to read thesed
command's output as is (to avoid the automatic removal of trailing newlines that a command substitution ($(...)
) would perform). -
${REPLY%$'\n'}
then removes a single trailing newline, which the<<<
has implicitly appended to the input.
bash
functions based on the above (for sed
):
-
quoteRe()
quotes (escapes) for use in a regex -
quoteSubst()
quotes for use in the substitution string of as///
call. - both handle multi-line input correctly
- Note that because
sed
reads a single line at at time by default, use ofquoteRe()
with multi-line strings only makes sense insed
commands that explicitly read multiple (or all) lines at once. - Also, using command substitutions (
$(...)
) to call the functions won't work for strings that have trailing newlines; in that event, use something likeIFS= read -d '' -r escapedValue <(quoteSubst "$value")
- Note that because
# SYNOPSIS
# quoteRe <text>
quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
# SYNOPSIS
# quoteSubst <text>
quoteSubst() {
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
printf %s "${REPLY%$'\n'}"
}
Example:
from=$'Cost\(*):\n$3.' # sample input containing metachars.
to='You & I'$'\n''eating A\1 sauce.' # sample replacement string with metachars.
# Should print the unmodified value of $to
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/$(quoteRe "$from")/$(quoteSubst "$to")/" <<<"$from"
Note the use of -e ':a' -e '$!{N;ba' -e '}'
to read all input at once, so that the multi-line substitution works.
perl
solution:
Perl has built-in support for escaping arbitrary strings for literal use in a regex: the quotemeta()
function or its equivalent \Q...\E
quoting.
The approach is the same for both single- and multi-line strings; for example:
from=$'Cost\(*):\n$3.' # sample input containing metachars.
to='You owe me $1/$& for'$'\n''eating A\1 sauce.' # sample replacement string w/ metachars.
# Should print the unmodified value of $to.
# Note that the replacement value needs NO escaping.
perl -s -0777 -pe 's/\Q$from\E/$to/' -- -from="$from" -to="$to" <<<"$from"
-
Note the use of
-0777
to read all input at once, so that the multi-line substitution works. -
The
-s
option allows placing-<var>=<val>
-style Perl variable definitions following--
after the script, before any filename operands.
Solution 2:
Building upon @mklement0's answer in this thread, the following tool will replace any single-line string (as opposed to regexp) with any other single-line string using sed
and bash
:
$ cat sedstr
#!/bin/bash
old="$1"
new="$2"
file="${3:--}"
escOld=$(sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g' <<< "$old")
escNew=$(sed 's/[&/\]/\\&/g' <<< "$new")
sed "s/$escOld/$escNew/g" "$file"
To illustrate the need for this tool, consider trying to replace a.*/b{2,}\nc
with d&e\1f
by calling sed
directly:
$ cat file
a.*/b{2,}\nc
axx/bb\nc
$ sed 's/a.*/b{2,}\nc/d&e\1f/' file
sed: -e expression #1, char 16: unknown option to `s'
$ sed 's/a.*\/b{2,}\nc/d&e\1f/' file
sed: -e expression #1, char 23: invalid reference \1 on `s' command's RHS
$ sed 's/a.*\/b{2,}\nc/d&e\\1f/' file
a.*/b{2,}\nc
axx/bb\nc
# .... and so on, peeling the onion ad nauseum until:
$ sed 's/a\.\*\/b{2,}\\nc/d\&e\\1f/' file
d&e\1f
axx/bb\nc
or use the above tool:
$ sedstr 'a.*/b{2,}\nc' 'd&e\1f' file
d&e\1f
axx/bb\nc
The reason this is useful is that it can be easily augmented to use word-delimiters to replace words if necessary, e.g. in GNU sed
syntax:
sed "s/\<$escOld\>/$escNew/g" "$file"
whereas the tools that actually operate on strings (e.g. awk
's index()
) cannot use word-delimiters.
NOTE: the reason to not wrap \
in a bracket expression is that if you were using a tool that accepts [\]]
as a literal ]
inside a bracket expression (e.g. perl and most awk implementations) to do the actual final substitution (i.e. instead of sed "s/$escOld/$escNew/g"
) then you couldn't use the approach of:
sed 's/[^^]/[&]/g; s/\^/\\^/g'
to escape \
by enclosing it in []
because then \x
would become [\][x]
which means \ or ] or [ or x
. Instead you'd need:
sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'
So while [\]
is probably OK for all current sed implementations, we know that \\
will work for all sed, awk, perl, etc. implementations and so use that form of escaping.
Solution 3:
It should be noted that the regular expression used in some answers above among this and that one:
's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'
seems to be wrong:
- Doing first
s/\^/\\^/g
followed bys/\\/\\\\/g
is an error, as any^
escaped first to\^
will then have its\
escaped again.
A better way seems to be: 's/[^\^]/[&]/g; s/[\^]/\\&/g;'
.
-
[^^\\]
with sed (BRE/ERE) should be just[^\^]
(or[^^\]
).\
has no special meaning inside a bracket expression and needs not to be quoted.