Numpy sort ndarray on multiple columns

numpy ndarray sort by the 1st, 2nd or 3rd column:

>>> a = np.array([[1,30,200], [2,20,300], [3,10,100]])

>>> a
array([[  1,  30, 200],         
       [  2,  20, 300],          
       [  3,  10, 100]])

>>> a[a[:,2].argsort()]           #sort by the 3rd column ascending
array([[  3,  10, 100],
       [  1,  30, 200],
       [  2,  20, 300]])

>>> a[a[:,2].argsort()][::-1]     #sort by the 3rd column descending
array([[  2,  20, 300],
       [  1,  30, 200],
       [  3,  10, 100]])

>>> a[a[:,1].argsort()]        #sort by the 2nd column ascending
array([[  3,  10, 100],
       [  2,  20, 300],
       [  1,  30, 200]])

To explain what is going on here: argsort() is passing back an array containing integer sequence of its parent: https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

>>> x = np.array([15, 30, 4, 80, 6])
>>> np.argsort(x)
array([2, 4, 0, 1, 3])

Sort by column 3, then by column 2 then 1:

>>> a = np.array([[2,30,200], [1,30,200], [1,10,200]])

>>> a
array([[  2,  30, 200],
       [  1,  30, 200],
       [  1,  10, 200]])

>>> a[np.lexsort((a[:,2], a[:,1],a[:,0]))]
array([[  1,  10, 200],
       [  1,  30, 200],
       [  2,  30, 200]])

Same as above but reversed:

>>> a[np.lexsort((a[:,2], a[:,1],a[:,0]))][::-1]
array([[  2  30 200]
       [  1  30 200]
       [  1  10 200]])

Import letting Numpy guess the type and sorting in place:

import numpy as np

# let numpy guess the type with dtype=None
my_data = np.genfromtxt(infile, dtype=None, names=["a", "b", "c", "d"])

# access columns by name
print(my_data["b"]) # column 1

# sort column 1 and column 0 
my_data.sort(order=["b", "a"])

# save specifying required format (tab separated values)
np.savetxt("sorted.tsv", my_data, fmt="%d\t%d\t%.6f\t%.6f"

Alternatively, specifying the input format and sorting to a new array:

import numpy as np

# tell numpy the first 2 columns are int and the last 2 are floats
my_data = np.genfromtxt(infile, dtype=[('a', '<i8'), ('b', '<i8'), ('x', '<f8'), ('d', '<f8')])

# access columns by name
print(my_data["b"]) # column 1

# get the indices to sort the array using lexsort
# the last element of the tuple (column 1) is used as the primary key
ind = np.lexsort((my_data["a"], my_data["b"]))

# create a new, sorted array
sorted_data = my_data[ind]

# save specifying required format (tab separated values)
np.savetxt("sorted.tsv", sorted_data, fmt="%d\t%d\t%.6f\t%.6f")

Output:

2   1   2.000000    0.000000
3   1   2.000000    0.000000
4   1   2.000000    0.000000
2   2   100.000000  0.000000
3   2   4.000000    0.000000
4   2   4.000000    0.000000
2   3   100.000000  0.000000
3   3   6.000000    0.000000
4   3   6.000000    0.000000

With np.lexsort you can sort based on several columns simultaneously. The columns that you want to sort by need to be passed in reverse. That means np.lexsort((col_b,col_a)) first sorts by col_a, and then by col_b:

my_data = np.array([[   2.,    1.,    2.,    0.],
                    [   2.,    2.,  100.,    0.],
                    [   2.,    3.,  100.,    0.],
                    [   3.,    1.,    2.,    0.],
                    [   3.,    2.,    4.,    0.],
                    [   3.,    3.,    6.,    0.],
                    [   4.,    1.,    2.,    0.],
                    [   4.,    2.,    4.,    0.],
                    [   4.,    3.,    6.,    0.]])

ind = np.lexsort((my_data[:,0],my_data[:,1]))
my_data[ind]

result:

array([[  2.,   1.,   2.,   0.],
       [  3.,   1.,   2.,   0.],
       [  4.,   1.,   2.,   0.],
       [  2.,   2., 100.,   0.],
       [  3.,   2.,   4.,   0.],
       [  4.,   2.,   4.,   0.],
       [  2.,   3., 100.,   0.],
       [  3.,   3.,   6.,   0.],
       [  4.,   3.,   6.,   0.]])

If you know that your first column is already sorted, you can use:

ind = my_data[:,1].argsort(kind='stable')
my_data[ind]

This makes sure that order is preserved for equal items. The quick sort algorithm that is generally used does not do that, though it is faster.


this method works for any numpy array:

import numpy as np

my_data = [[   2.,    1.,    2.,    0.],
           [   2.,    2.,  100.,    0.],
           [   2.,    3.,  100.,    0.],
           [   3.,    1.,    2.,    0.],
           [   3.,    2.,    4.,    0.],
           [   3.,    3.,    6.,    0.],
           [   4.,    1.,    2.,    0.],
           [   4.,    2.,    4.,    0.],
           [   4.,    3.,    6.,    0.]]
my_data = np.array(my_data)
r = np.core.records.fromarrays([my_data[:,1],my_data[:,0]],names='a,b')
my_data = my_data[r.argsort()]
print(my_data)

Result:

[[  2.   1.   2.   0.]
 [  3.   1.   2.   0.]
 [  4.   1.   2.   0.]
 [  2.   2. 100.   0.]
 [  3.   2.   4.   0.]
 [  4.   2.   4.   0.]
 [  2.   3. 100.   0.]
 [  3.   3.   6.   0.]
 [  4.   3.   6.   0.]]