Is there any numpy group by function?

Is there any function in numpy to group this array down below by the first column?

I couldn't find any good answer over the internet..

>>> a
array([[  1, 275],
       [  1, 441],
       [  1, 494],
       [  1, 593],
       [  2, 679],
       [  2, 533],
       [  2, 686],
       [  3, 559],
       [  3, 219],
       [  3, 455],
       [  4, 605],
       [  4, 468],
       [  4, 692],
       [  4, 613]])

Wanted output:

array([[[275, 441, 494, 593]],
       [[679, 533, 686]],
       [[559, 219, 455]],
       [[605, 468, 692, 613]]], dtype=object)

Solution 1:

Inspired by Eelco Hoogendoorn's library, but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with a = a[a[:, 0].argsort()])

>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1][1:])
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

I didn't "timeit" ([EDIT] see below) but this is probably the faster way to achieve the question :

No python native loop
Result lists are numpy arrays, in case you need to make other numpy operations on them, no new conversion will be needed
Complexity looks O(n) (with sort it goes O(n log(n))

[EDIT sept 2021] I ran timeit on my Macbook M1, for a table of 10k random integers. The duration is for 1000 calls.

>>> a = np.random.randint(5, size=(10000, 2))  # 5 different "groups"

# Only the sort
>>> a = a[a[:, 0].argsort()]
⏱ 116.9 ms

# Group by on the already sorted table
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 35.5 ms

# Total sort + groupby
>>> a = a[a[:, 0].argsort()]
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 153.0 ms 👑

# With numpy-indexed package (cf Eelco answer)
>>> npi.group_by(a[:, 0]).split(a[:, 1])
⏱ 353.3 ms

# With pandas (cf Piotr answer)
>>> df = pd.DataFrame(a, columns=["key", "val"]) # no timer for this line
>>> df.groupby("key").val.apply(pd.Series.tolist) 
⏱ 362.3 ms

# With defaultdict, the python native way (cf Piotr answer)
>>> d = defaultdict(list)
for key, val in a:
    d[key].append(val)
⏱ 3543.2 ms

# With numpy_groupies (cf Michael answer)
>>> aggregate(a[:,0], a[:,1], "array", fill_value=[])
⏱ 376.4 ms

Second timeit scenario, with 500 different groups instead of 5. I'm surprised about pandas, I ran several times, but it just behave badly in this scenario.

>>> a = np.random.randint(500, size=(10000, 2))

just the sort  141.1 ms
already_sorted 392.0 ms
sort+groupby   542.4 ms
pandas        2695.8 ms
numpy-indexed  800.6 ms
defaultdict   3707.3 ms
numpy_groupies 836.7 ms

[EDIT] I improved the answer thanks to ns63sr's answer and Behzad Shayegh (cf comment) Thanks also TMBailey for noticing complexity of argsort is n log(n).

Solution 2:

The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library.

import numpy_indexed as npi
npi.group_by(a[:, 0]).split(a[:, 1])

Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.

Is there any numpy group by function?

Solution 1:

Solution 2:

Related

Recent Posts