Sorting is not consistent using the Unix command 'sort'
Solution 1:
The reason you're getting these results is that your sort is not numeric, it is based upon canonical values of the columns.
There is a command line switch to sort that will sort numerically, this is what you want (type 'man sort' in your google bar)
Solution 2:
There's something wrong with your question: you claim to use $'\xE7'
as the record separator, but that byte doesn't appear in the file. If this is really the command you ran and these are really your outputs, then file A was sorted based on the whole line and file B was sorted randomly (all fields 2 are empty, and sort
is not stable by default). However, since file 2 does look sorted on the second “,
”-separated field in your output from file B, I guess this is a bug in your question and either your code used a space or comma as separator or your data contains the byte E7 where your data here has a comma and a space.
If you do pass a -t
option to set a separator for sort, you must pass the same separator to join
. In any case, you need to tell join
which columns to join. For example:
<a.input sort -t $'\xE7' -k1 >a.sorted
<b.input sort -t $'\xE7' -k2 >b.sorted
join -1 1 -2 2 -t $'\xE7' a.sorted b.sorted >joined
Furthermore, given that “11622409 ,
” appears before ”1162240 ,
” in your output from file A, it appears is that you're running sort
in a locale that produces results approaching human sorting rules (only approaching, because sort
is not refined enough to match the fairly complicated rules used in serious typography). You will get less surprising results if you change your locale to one that produces results suitable for computer consumption. In practice, that means your LC_COLLATE
setting should be C
(or its synonym POSIX
). (Any other locale tends to break scripts that use sort
, though yours should in fact be ok.) Example:
$ cat a
11622409 , abdde, def
1162241 , abe, deed
11622410, def,dede
$ LC_COLLATE=en_US sort <a
11622409 , abdde, def
11622410, def,dede
1162241 , abe, deed
$ LC_COLLATE=C sort <a
11622409 , abdde, def
1162241 , abe, deed
11622410, def,dede
If you're running join
in the same locale as sort
, you should be ok. Note that sort
produces lexically sorted output, not numerically sorted; but that is what you want as the input to join
.