Why is character "£" in a string interpreted strange in the command cut?

I'm developing a bash script and came up with the following strange behaviour!

$ echo £ |cut -c 1
�

The sign £ is passed to the next command cut whose filter is picking one character only.

When I modify the filter in the cut command to pick 2 characters, then the £ is passed through!

$ echo £ |cut -c 1-2
£

Not a severe problem, I have a workaround solution in the script, but why does the filter in the cut command require 2 positions instead of 1 when picking a £ sign?


Solution 1:

The cut command in Ubuntu is not multi-byte character aware. Characters are the same as bytes for this version of the cut command.

The pound sign (£) is a UTF-8 character that consists of two bytes (c2 and a3):

$ echo £ | od -t x1
0000000 c2 a3 0a
0000003

Note: The 0a character is the "New Line" (ASCII "Line Feed" character).

When you cut the first character from the line, you are selecting only the c2 part of £, and this is not a valid UTF-8 character. As a result you get the strange question mark (the replacement character) on screen:

$ echo £ | cut -c 1 | od -t x1
0000000 c2 0a
0000002

Note: The above was tested with the latest version of cut in Ubuntu 20.10 (GNU coreutils version 8.32).

If you want to select multi-byte characters, you can use the grep (GNU grep version 3.4) command like this:

$ echo x£β | grep -o '^.'
x
$ echo x£β | grep -o '^..'
x£
$ echo x£β | grep -o '^...'
x£β

This answer was improved with the help of the comments.

Solution 2:

In UTF-8 encoding, the hex value of £ is 0xC2 0xA3 (c2a3) which is 11000010 10100011 in binary.

So it's two bytes (like two character). cut -c considers each byte a character which produces .


$ echo -n £ | xxd
00000000: c2a3                                     ..

$ echo -n £ | wc --bytes
2