Shuffle DataFrame rows
I have the following DataFrame:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
...
20 7 8 9 2
21 10 11 12 2
...
45 13 14 15 3
46 16 17 18 3
...
The DataFrame is read from a csv file. All rows which have Type
1 are on top, followed by the rows with Type
2, followed by the rows with Type
3, etc.
I would like to shuffle the order of the DataFrame's rows, so that all Type
's are mixed. A possible result could be:
Col1 Col2 Col3 Type
0 7 8 9 2
1 13 14 15 3
...
20 1 2 3 1
21 10 11 12 2
...
45 4 5 6 1
46 16 17 18 3
...
How can I achieve this?
The idiomatic way to do this with Pandas is to use the .sample
method of your dataframe to sample all rows without replacement:
df.sample(frac=1)
The frac
keyword argument specifies the fraction of rows to return in the random sample, so frac=1
means return all rows (in random order).
Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
df = df.sample(frac=1).reset_index(drop=True)
Here, specifying drop=True
prevents .reset_index
from creating a column containing the old index entries.
Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old)
is not the same as id(df_new)
), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:
$ python3 -m memory_profiler .\test.py
Filename: .\test.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
You can simply use sklearn for this
from sklearn.utils import shuffle
df = shuffle(df)
You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation
(but np.random.choice
is also a possibility):
In [12]: df = pd.read_csv(StringIO(s), sep="\s+")
In [13]: df
Out[13]:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
20 7 8 9 2
21 10 11 12 2
45 13 14 15 3
46 16 17 18 3
In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]:
Col1 Col2 Col3 Type
46 16 17 18 3
45 13 14 15 3
20 7 8 9 2
0 1 2 3 1
1 4 5 6 1
21 10 11 12 2
If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)
TL;DR: np.random.shuffle(ndarray)
can do the job.
So, in your case
np.random.shuffle(DataFrame.values)
DataFrame
, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)
So if you use np.random.shuffle()
, it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrame
remains unshuffled.
Though, there are some points to consider.
- function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
-
sklearn.utils.shuffle()
, as user tj89 suggested, can designaterandom_state
along with another option to control output. You may want that for dev purpose. -
sklearn.utils.shuffle()
is faster. But WILL SHUFFLE the axis info(index, column) of theDataFrame
along with thendarray
it contains.
Benchmark result
between sklearn.utils.shuffle()
and np.random.shuffle()
.
ndarray
nd = sklearn.utils.shuffle(nd)
0.10793248389381915 sec. 8x faster
np.random.shuffle(nd)
0.8897626010002568 sec
DataFrame
df = sklearn.utils.shuffle(df)
0.3183923360193148 sec. 3x faster
np.random.shuffle(df.values)
0.9357550159329548 sec
Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use
sklearn.utils.shuffle()
. Otherwise, usenp.random.shuffle()
used code
import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''
timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)
pythonbenchmarking
What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:
df.sample(n=len(df), random_state=42)
this makes sure, that you keep your random choice always replicatable