Why is character "£" in a string interpreted strange in the command cut?
I'm developing a bash script and came up with the following strange behaviour!
$ echo £ |cut -c 1
�
The sign £
is passed to the next command cut
whose filter is picking one character only.
When I modify the filter in the cut
command to pick 2 characters, then the £
is passed through!
$ echo £ |cut -c 1-2
£
Not a severe problem, I have a workaround solution in the script, but why does the filter in the cut command require 2 positions instead of 1 when picking a £
sign?
Solution 1:
The cut
command in Ubuntu is not multi-byte character aware. Characters are the same as bytes for this version of the cut
command.
The pound sign (£
) is a UTF-8 character that consists of two bytes (c2
and a3
):
$ echo £ | od -t x1
0000000 c2 a3 0a
0000003
Note: The 0a
character is the "New Line" (ASCII "Line Feed" character).
When you cut
the first character from the line, you are selecting only the c2
part of £
, and this is not a valid UTF-8 character. As a result you get the strange question mark �
(the replacement character) on screen:
$ echo £ | cut -c 1 | od -t x1
0000000 c2 0a
0000002
Note: The above was tested with the latest version of cut
in Ubuntu 20.10 (GNU coreutils version 8.32).
If you want to select multi-byte characters, you can use the grep
(GNU grep version 3.4) command like this:
$ echo x£β | grep -o '^.'
x
$ echo x£β | grep -o '^..'
x£
$ echo x£β | grep -o '^...'
x£β
This answer was improved with the help of the comments.
Solution 2:
In UTF-8 encoding, the hex value of £
is 0xC2 0xA3 (c2a3)
which is 11000010 10100011
in binary.
So it's two bytes (like two character). cut -c
considers each byte a character which produces �
.
$ echo -n £ | xxd
00000000: c2a3 ..
$ echo -n £ | wc --bytes
2