Numpy shuffle multidimensional array by row only, keep column order unchanged
How can I shuffle a multidimensional array by row only in Python (so do not shuffle the columns).
I am looking for the most efficient solution, because my matrix is very huge. Is it also possible to do this highly efficient on the original array (to save memory)?
Example:
import numpy as np
X = np.random.random((6, 2))
print(X)
Y = ???shuffle by row only not colls???
print(Y)
What I expect now is original matrix:
[[ 0.48252164 0.12013048]
[ 0.77254355 0.74382174]
[ 0.45174186 0.8782033 ]
[ 0.75623083 0.71763107]
[ 0.26809253 0.75144034]
[ 0.23442518 0.39031414]]
Output shuffle the rows not cols e.g.:
[[ 0.45174186 0.8782033 ]
[ 0.48252164 0.12013048]
[ 0.77254355 0.74382174]
[ 0.75623083 0.71763107]
[ 0.23442518 0.39031414]
[ 0.26809253 0.75144034]]
You can use numpy.random.shuffle()
.
This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same.
In [2]: import numpy as np
In [3]:
In [3]: X = np.random.random((6, 2))
In [4]: X
Out[4]:
array([[0.71935047, 0.25796155],
[0.4621708 , 0.55140423],
[0.22605866, 0.61581771],
[0.47264172, 0.79307633],
[0.22701656, 0.11927993],
[0.20117207, 0.2754544 ]])
In [5]: np.random.shuffle(X)
In [6]: X
Out[6]:
array([[0.71935047, 0.25796155],
[0.47264172, 0.79307633],
[0.4621708 , 0.55140423],
[0.22701656, 0.11927993],
[0.20117207, 0.2754544 ],
[0.22605866, 0.61581771]])
For other functionalities you can also check out the following functions:
-
random.Generator.shuffle
-
random.Generator.permutation
-
random.Generator.permuted
The function random.Generator.permuted
is introduced in Numpy's 1.20.0 Release.
The new function differs from
shuffle
andpermutation
in that the subarrays indexed by an axis are permuted rather than the axis being treated as a separate 1-D array for every combination of the other indexes. For example, it is now possible to permute the rows or columns of a 2-D array.
You can also use np.random.permutation
to generate random permutation of row indices and then index into the rows of X
using np.take
with axis=0
. Also, np.take
facilitates overwriting to the input array X
itself with out=
option, which would save us memory. Thus, the implementation would look like this -
np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
Sample run -
In [23]: X
Out[23]:
array([[ 0.60511059, 0.75001599],
[ 0.30968339, 0.09162172],
[ 0.14673218, 0.09089028],
[ 0.31663128, 0.10000309],
[ 0.0957233 , 0.96210485],
[ 0.56843186, 0.36654023]])
In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);
In [25]: X
Out[25]:
array([[ 0.14673218, 0.09089028],
[ 0.31663128, 0.10000309],
[ 0.30968339, 0.09162172],
[ 0.56843186, 0.36654023],
[ 0.0957233 , 0.96210485],
[ 0.60511059, 0.75001599]])
Additional performance boost
Here's a trick to speed up np.random.permutation(X.shape[0])
with np.argsort()
-
np.random.rand(X.shape[0]).argsort()
Speedup results -
In [32]: X = np.random.random((6000, 2000))
In [33]: %timeit np.random.permutation(X.shape[0])
1000 loops, best of 3: 510 µs per loop
In [34]: %timeit np.random.rand(X.shape[0]).argsort()
1000 loops, best of 3: 297 µs per loop
Thus, the shuffling solution could be modified to -
np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
Runtime tests -
These tests include the two approaches listed in this post and np.shuffle
based one in @Kasramvd's solution
.
In [40]: X = np.random.random((6000, 2000))
In [41]: %timeit np.random.shuffle(X)
10 loops, best of 3: 25.2 ms per loop
In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
10 loops, best of 3: 53.3 ms per loop
In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
10 loops, best of 3: 53.2 ms per loop
So, it seems using these np.take
based could be used only if memory is a concern or else np.random.shuffle
based solution looks like the way to go.