How does this 'sed' substitution command with lots of @ signs work?

Solution 1:

In sed, substitute commands are usually written as s/pattern/replacement/options. However, it's not necessary to use / - you can use other characters if it is convenient, so it could be s@pattern@replacement@options or s:foo:bar:g. s@+@ @g is like s/+/ /g - replace all + with spaces. Similarly s@%@\\x@g replaces all % with \x (a single backslash is an escape character in sed, so you need two to get an actual backslash).

A string like foo+%2Fbar will then become foo \x2Fbar. printf "%b" will expand the backslash-escaped sequences like \x2F (the ASCII character whose hexadecimal value is 2F, which is /) to finally give you foo /bar.

Solution 2:

The command you're asking about for decoding +es and % sequences from URLs is not just a sed command, it's a pipeline that processes input with sed, then pipes it to xargs for further processing. First lets look at the sed command:

sed 's@+@ @g;s@%@\\x@g'

You may be more accustomed to seeing it with / rather than @ as the separator, which could easily have been done here without complication since / appears in neither of the search patterns nor either of the replacement texts. This command is equivalent:

sed 's/+/ /g;s/%/\\x/g'

Like /, @ is a perfectly good punctuation character for sed.

On each line of input:

  1. s@+@ @g (s/+/ /g) substitutes (s) occurrences of + with a space. This affects all +es on a line (g), not just the first one.

  2. ; ends the action ("command") and allows you to specify another one in the same "script."

  3. s@%@\\x@g (s/%/\\x/g) substitutes (s) occurrences of % with \x. As before, it acts on all rather than just the first of each line (g).

    In \\x the \\ represents just one \ because \ has a special meaning to sed. Its special meaning is actually as the character you use to take away the special meaning of another character that comes after it that would otherwise have special meaning. So it must be escaped as \\.


Now let's look a the xargs command, whose purpose is to run printf.

xargs constructs command lines. If you run xargs command..., where command... is one or more word, xargs runs command... with additional command-line arguments read from its input. In this case, the input to xargs is the output of sed, because of the pipe (|). Normally xargs interprets any whitespace in its input to mean that the text before and after it constitutes separate arguments, but the -0 option makes it split arguments at occurrences of the null character instead.

In the intended use of your command, a null character won't appear and xargs will run printf %b with just one additional command-line argument, the output of the sed command. Thus, while not equivalent in general, in this case the whole pipeline might instead have been written like this using command substitution instead of xargs:

printf '%b\n' "$(sed 's/+/ /g;s/%/\\x/g')"

As for what printf is intended to do here, as muru says the %b format specifier consumes and prints an argument (like %s) but causes backslash escapes--of the sort the sed command on the left side of the pipe was written to generate--to be translated into the characters they represent.

Suppose I run that command and pass http://foldoc.org/debugging%20by%20printf as input. I get http://foldoc.org/debugging by printf as output, because the %20 sequences are translated into spaces.

Solution 3:

That’s the beauty of sed, it applies its paradigms to itself... After the command (such as s or tr or nothing), the next character is considered the separator.

You should choose wisely to avoid interference with the shell and the command itself, and keep the thing readable, but it’s perfectly valid to write something as horrid as:

echo 'arrival' | sed srarbrg

...and get brrivbl as a result, which is what you expect. You can have fun making it really cryptic, such as in:

echo 'arrival' | sed s\fa\fb\fg   # \f is form feed, chr(12)

The common use is to use the slash as delimiter, but when your expression contains the delimiter, it makes it easier to grab what the intent is. Your delimiter can be anything in the ASCII8 range (multibyte delimiters such as £ provoke an error).

Just remember the goal is to make things easier, not more cryptic.