What are the R sorting rules of character vectors?
R sorts character vectors in a sequence which I describe as alphabetic, not ASCII.
For example:
sort(c("dog", "Cat", "Dog", "cat"))
[1] "cat" "Cat" "dog" "Dog"
Three questions:
- What is the technically correct terminology to describe this sort order?
- I can not find any reference to this in the manuals on CRAN. Where can I find a description of the sorting rules in R?
- is this any different from this sort of behaviour in other languages like C, Java, Perl or PHP?
Details:
for sort()
states:
The sort order for character vectors will depend on the collating sequence of the locale in use: see ‘Comparison’. The sort order for factors is the order of their levels (which is particularly appropriate for ordered factors).
and help(Comparison)
then shows:
Comparison of strings in character vectors is lexicographicwithin the strings using the collating sequence of the locale in use:see ‘locales’. The collating sequence of locales such as ‘en_US’ is normally different from ‘C’ (which should use ASCII) and can be surprising. Beware of making _any_ assumptions about the collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’, and collation is not necessarily character-by-character - in Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’ may or may not be a single sorting unit: if it is it follows ‘g’. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.
so it depends on your locale setting.