making and updating multiple pandas dataframes using dicts (avoiding repetative code)
I have a dataframe of id numbers (n = 140, but it could be more or less) and I have 5 group leaders. Each group leader needs to be randomly assigned an amount of these ids (for ease lets make it even so n=28, but I need to be able to control the amounts) and those rows need to be split out into a new df
and then droped from the original dataframe so that there is no crossover between leaders.
import pandas as pd
import numpy as np
#making the df
df = pd.DataFrame()
df['ids'] = np.random.randint(1, 140, size=140)
df['group_leader'] = ''
# list of leader names
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
I can do this for each leader with something like
df.loc[df.sample(n=28).index, 'group_leader'] = 'George'
g = df[df['group_leader']=='George'].copy()
df = df[df['group_leader] != 'George']
print(df.shape()[0]) #double checking that df has less ids in it
However, doing this individually for each group leaders seems really un-pythonic (not that I'm an expert on that) and is not easy to refactor into a function.
I thought that I might be able to do it with a dict
and a for loop
frames = dict.fromkeys('group_leaders', pd.DataFrame())
for i in frames.keys(): #allows me to fill the cells with the string key?
df.loc[df.sample(n=28).index, 'group_leader'] = str(i)
frames[i].update(df[df['group_leader']== str(i)].copy())#also tried append()
print(frames[i].head())
df = df[df['group_leader'] != str(i)]
print(f'df now has {df.shape[0]} ids left') #just in case there's a remainder of ids
However, the new dataframes are still empty and I get the error:
Traceback (most recent call last):
File "C:\Users\path\to\the\file\file.py", line 38, in <module>
df.loc[df.sample(n=28).index, 'group_leader'] = str(i)
File "C:\Users\path\to\the\file\pandas\core\generic.py", line 5356, in sample
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 909, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken
This leads me to believe that I'm doing two things wrong:
- Either making the dict incorectly or updating it incorrectly.
- Making the for loop run in such a way that it tries to run 1 too many times.
I have tried to be as clear as possible and present a minimally useful version of what I need, any help would be appreciated.
Note - I'm aware that 5 divides well into 140 and there may be cases where this isn't the case but I'm pretty sure I can handle that myself with if-else
if it's needed.
Solution 1:
You can use np.repeat
and np.random.shuffle
:
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
leaders = np.repeat(leaders, 28)
np.random.shuffle(leaders)
df['group_leader'] = leaders
Output:
>>> df
ids group_leader
0 138 John
1 36 Apu
2 99 John
3 91 George
4 58 Ringo
.. ... ...
135 43 Ringo
136 84 Apu
137 94 John
138 56 Ringo
139 58 Paul
[140 rows x 2 columns]
>>> df.value_counts('group_leader')
group_leader
Apu 28
George 28
John 28
Paul 28
Ringo 28
dtype: int64
Update
df = pd.DataFrame({'ids': np.random.randint(1, 113, size=113)})
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
leaders = np.repeat(leaders, np.ceil(len(df) / len(leaders)))
np.random.shuffle(leaders)
df['group_leader'] = leaders[:len(df)]
Output:
>>> df.value_counts('group_leader')
group_leader
Apu 23
John 23
Ringo 23
George 22
Paul 22
dtype: int64