shuffling/permutating a DataFrame in pandas
What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0)
that takes a dataframe, a number of shuffles n
, and an axis (axis=0
is rows, axis=1
is columns) and returns a copy of the dataframe that has been shuffled n
times.
Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index
that loses all that information. I want the resulting df
to be the same as the original except with the order of rows or order of columns different.
Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns a
and b
, I want each row shuffled on its own, so that you don't have the same associations between a
and b
as you do if you just re-order each row as a whole. Something like:
for 1...n:
for each col in df: shuffle column
return new_df
But hopefully more efficient than naive looping. This does not work for me:
def shuffle(df, n, axis=0):
shuffled_df = df.copy()
for k in range(n):
shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
return shuffled_df
df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)
Use numpy's random.permuation
function:
In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [2]: df
Out[2]:
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
A B
0 0 0
5 5 5
6 6 6
3 3 3
8 8 8
7 7 7
9 9 9
1 1 1
2 2 2
4 4 4
Sampling randomizes, so just sample the entire data frame.
df.sample(frac=1)
As @Corey Levinson notes, you have to be careful when you reassign:
df['column'] = df['column'].sample(frac=1).reset_index(drop=True)
In [16]: def shuffle(df, n=1, axis=0):
...: df = df.copy()
...: for _ in range(n):
...: df.apply(np.random.shuffle, axis=axis)
...: return df
...:
In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [18]: shuffle(df)
In [19]: df
Out[19]:
A B
0 8 5
1 1 7
2 7 3
3 6 2
4 3 4
5 0 1
6 9 0
7 4 6
8 2 8
9 5 9
You can use sklearn.utils.shuffle()
(requires sklearn 0.16.1 or higher to support Pandas data frames):
# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))
# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))
outputs:
df: A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
df: A B
1 1 1
0 0 0
3 3 3
4 4 4
2 2 2
Then you can use df.reset_index()
to reset the index column, if needs to be:
df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)
outputs:
df: A B
0 1 1
1 0 0
2 4 4
3 2 2
4 3 3
A simple solution in pandas is to use the sample
method independently on each column. Use apply
to iterate over each column:
df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
df.apply(lambda x: x.sample(frac=1).values)
a b
0 4 2
1 1 6
2 6 5
3 5 3
4 2 4
5 3 1
You must use .value
so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:
df.apply(lambda x: x.sample(frac=1))
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6