How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
This approach uses .groupby()
and .ngroup()
(new in Pandas 0.20.2) to create the id
column:
df['id'] = df.groupby(['LastName','FirstName']).ngroup()
>>> df
First Second id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2
I checked timings and, for the small dataset in this example, Alexander's answer is faster:
%timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
1000 loops, best of 3: 848 µs per loop
%timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup())
1000 loops, best of 3: 1.22 ms per loop
However, for larger dataframes, the groupby()
approach appears to be faster. To create a large, representative data set, I used faker
to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.
import faker
fakenames = faker.Faker()
first = [ fakenames.first_name() for _ in range(5000) ]
last = [ fakenames.last_name() for _ in range(5000) ]
df2 = pd.DataFrame({'FirstName':first, 'LastName':last})
df2 = pd.concat([df2, df2.iloc[:2000]])
Running the timing on this larger data set gives:
%timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes)
100 loops, best of 3: 5.22 ms per loop
%timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup())
100 loops, best of 3: 3.1 ms per loop
You may want to test both approaches on your data set to determine which one works best given the size of your data.
You could join the last name and first name, convert it to a category, and then get the codes.
Of course, multiple people with the same name would have the same id
.
df = df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
>>> df
FirstName LastName id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2
This method allow the 'id' column name to be defined with a variable. Plus I find it a little easier to read compared to the assign or groupby methods.
# Create Dataframe
df = pd.DataFrame(
{'FirstName': ['Tom','Tom','David','Alex','Alex'],
'LastName': ['Jones','Jones','Smith','Thompson','Thompson'],
})
newIdName = 'id' # Set new name here.
df[newIdName] = (df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes
Output:
>>> df
FirstName LastName id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2