In utf-8 collation, why 11- is less then 1-?
I found that the sort result in ASCII:
Source file test
:
1-
11-
1-a
11-a
Sort using ASCII:
$ LANG=en_US.ascii sort test
1-
1-a
11-
11-a
And using UTF-8:
$ LANG=en_US.utf8 sort test
1-
11-
11-a
1-a
I feel it's so counter-intuitive, and it's not dictionary order.
Isn't the character '-' (002d
) is always less then [0-9]
(0030-0039
)?
What's the general rule in UTF-8 collation?
And how to bypass it, just make -
be less then [0-9]
while keep other characters unchanged for UTF-8, in Linux? (So it can affects the result of ls --sort
, sort
, etc. )
Solution 1:
The minus sign is ignored in the first pass. So the first pass sorts 1
, 11
, 1a
, 11a
. Since 1
< a
, you get 11a
< 1a
and thus 11-a
< 1-a
.
-
is a variable collation element, meaning that you/the implementor can choose to ignore it. The glibc implementation apparently does so. In practice, most punctuation is affected by this behavior.
You can read up on the gory details in the Unicode Collation Algorithm, modulo how glibc implements it.