making and updating multiple pandas dataframes using dicts (avoiding repetative code)

I have a dataframe of id numbers (n = 140, but it could be more or less) and I have 5 group leaders. Each group leader needs to be randomly assigned an amount of these ids (for ease lets make it even so n=28, but I need to be able to control the amounts) and those rows need to be split out into a new df and then droped from the original dataframe so that there is no crossover between leaders.

import pandas as pd
import numpy as np

#making the df
df = pd.DataFrame()
df['ids'] = np.random.randint(1, 140, size=140)
df['group_leader'] = ''


# list of leader names
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']

I can do this for each leader with something like

df.loc[df.sample(n=28).index, 'group_leader'] = 'George'
g = df[df['group_leader']=='George'].copy()
df = df[df['group_leader] != 'George']
print(df.shape()[0]) #double checking that df has less ids in it

However, doing this individually for each group leaders seems really un-pythonic (not that I'm an expert on that) and is not easy to refactor into a function.

I thought that I might be able to do it with a dict and a for loop

frames = dict.fromkeys('group_leaders', pd.DataFrame())

for i in frames.keys(): #allows me to fill the cells with the string key?
    df.loc[df.sample(n=28).index, 'group_leader'] = str(i)
    frames[i].update(df[df['group_leader']== str(i)].copy())#also tried append()
    print(frames[i].head())
    df = df[df['group_leader'] != str(i)]
    print(f'df now has {df.shape[0]} ids left') #just in case there's a remainder of ids

However, the new dataframes are still empty and I get the error:

    Traceback (most recent call last):
  File "C:\Users\path\to\the\file\file.py", line 38, in <module>
    df.loc[df.sample(n=28).index, 'group_leader'] = str(i)
  File "C:\Users\path\to\the\file\pandas\core\generic.py", line 5356, in sample
    locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
  File "mtrand.pyx", line 909, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken

This leads me to believe that I'm doing two things wrong:

  1. Either making the dict incorectly or updating it incorrectly.
  2. Making the for loop run in such a way that it tries to run 1 too many times.

I have tried to be as clear as possible and present a minimally useful version of what I need, any help would be appreciated.

Note - I'm aware that 5 divides well into 140 and there may be cases where this isn't the case but I'm pretty sure I can handle that myself with if-else if it's needed.


Solution 1:

You can use np.repeat and np.random.shuffle:

leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
leaders = np.repeat(leaders, 28)
np.random.shuffle(leaders)
df['group_leader'] = leaders

Output:

>>> df
     ids group_leader
0    138         John
1     36          Apu
2     99         John
3     91       George
4     58        Ringo
..   ...          ...
135   43        Ringo
136   84          Apu
137   94         John
138   56        Ringo
139   58         Paul

[140 rows x 2 columns]

>>> df.value_counts('group_leader')
group_leader
Apu       28
George    28
John      28
Paul      28
Ringo     28
dtype: int64

Update

df = pd.DataFrame({'ids': np.random.randint(1, 113, size=113)})

leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
leaders = np.repeat(leaders, np.ceil(len(df) / len(leaders)))
np.random.shuffle(leaders)
df['group_leader'] = leaders[:len(df)]

Output:

>>> df.value_counts('group_leader')
group_leader
Apu       23
John      23
Ringo     23
George    22
Paul      22
dtype: int64