Is there any numpy group by function?
Is there any function in numpy to group this array down below by the first column?
I couldn't find any good answer over the internet..
>>> a
array([[ 1, 275],
[ 1, 441],
[ 1, 494],
[ 1, 593],
[ 2, 679],
[ 2, 533],
[ 2, 686],
[ 3, 559],
[ 3, 219],
[ 3, 455],
[ 4, 605],
[ 4, 468],
[ 4, 692],
[ 4, 613]])
Wanted output:
array([[[275, 441, 494, 593]],
[[679, 533, 686]],
[[559, 219, 455]],
[[605, 468, 692, 613]]], dtype=object)
Solution 1:
Inspired by Eelco Hoogendoorn's library, but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with a = a[a[:, 0].argsort()]
)
>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1][1:])
[array([275, 441, 494, 593]),
array([679, 533, 686]),
array([559, 219, 455]),
array([605, 468, 692, 613])]
I didn't "timeit" ([EDIT] see below) but this is probably the faster way to achieve the question :
- No python native loop
- Result lists are numpy arrays, in case you need to make other numpy operations on them, no new conversion will be needed
- Complexity looks O(n) (with sort it goes O(n log(n))
[EDIT sept 2021] I ran timeit on my Macbook M1, for a table of 10k random integers. The duration is for 1000 calls.
>>> a = np.random.randint(5, size=(10000, 2)) # 5 different "groups"
# Only the sort
>>> a = a[a[:, 0].argsort()]
⏱ 116.9 ms
# Group by on the already sorted table
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 35.5 ms
# Total sort + groupby
>>> a = a[a[:, 0].argsort()]
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 153.0 ms 👑
# With numpy-indexed package (cf Eelco answer)
>>> npi.group_by(a[:, 0]).split(a[:, 1])
⏱ 353.3 ms
# With pandas (cf Piotr answer)
>>> df = pd.DataFrame(a, columns=["key", "val"]) # no timer for this line
>>> df.groupby("key").val.apply(pd.Series.tolist)
⏱ 362.3 ms
# With defaultdict, the python native way (cf Piotr answer)
>>> d = defaultdict(list)
for key, val in a:
d[key].append(val)
⏱ 3543.2 ms
# With numpy_groupies (cf Michael answer)
>>> aggregate(a[:,0], a[:,1], "array", fill_value=[])
⏱ 376.4 ms
Second timeit scenario, with 500 different groups instead of 5. I'm surprised about pandas, I ran several times, but it just behave badly in this scenario.
>>> a = np.random.randint(500, size=(10000, 2))
just the sort 141.1 ms
already_sorted 392.0 ms
sort+groupby 542.4 ms
pandas 2695.8 ms
numpy-indexed 800.6 ms
defaultdict 3707.3 ms
numpy_groupies 836.7 ms
[EDIT] I improved the answer thanks to ns63sr's answer and Behzad Shayegh (cf comment) Thanks also TMBailey for noticing complexity of argsort is n log(n).
Solution 2:
The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library.
import numpy_indexed as npi
npi.group_by(a[:, 0]).split(a[:, 1])
Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.