create new column of dataframe base on value of another dataframe run fast?
i want to create a new columns for my df_cau2['continent']
. first there r 2 df of mine:
country_continent
Continent
Country
Afghanistan Asia
Albania Europe
Algeria Africa
American Samoa Oceania
and
df_cau2
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False
to create new column continent
i use apply for df_cau2 like this:
def same_continent(home,away):
if country_continent.loc[home].Continent == country_continent.loc[away].Continent:
return country_continent.loc[home].Continent
return 'None'
df_cau2['continent']=df_cau2.apply(lambda x: same_continent(x['home_team'],x['away_team']),axis=1)
df_cau2.head()
with 39480 rows of df_cau2, this code run too slow, how can i change my code to run it's faster? i am thinking about using np.select
but i don't know how to use it's in this case.
This is result that i want:
date home_team away_team home_score away_score tournament city country neutral continent
7611 1970-09-11 Iran Turkey 1 1 Friendly Teheran Iran False None
31221 2009-03-11 Nepal Pakistan 1 0 Friendly Kathmandu Nepal False Asia
32716 2010-11-17 Colombia Peru 1 1 Friendly Bogotá Colombia False South America
Thanks
Solution 1:
IIUC, you want to set continent
column only if home_team
and away_team
columns are in the same continent:
home_continent = df1['home_team'].map(df2.squeeze())
away_continent = df1['away_team'].map(df2.squeeze())
m = home_continent == away_continent
df1.loc[m, 'continent'] = home_continent.loc[m]
print(df1)
# Output
home_team away_team continent
0 Canada England NaN
1 France Spain Europe
2 China Japan Asia
Setup a MRE
df1 = pd.DataFrame({'home_team': ['Canada', 'France', 'China'],
'away_team': ['England', 'Spain', 'Japan']})
print(df1)
df2 = pd.DataFrame({'Country': ['Canada', 'China', 'England',
'France', 'Japan', 'Spain'],
'Continent': ['North America', 'Asia', 'Europe',
'Europe', 'Asia', 'Europe']}).set_index('Country')
print(df2)
# Output df1
home_team away_team
0 Canada England
1 France Spain
2 China Japan
# Output df2
Continent
Country
Canada North America
China Asia
England Europe
France Europe
Japan Asia
Spain Europe
Solution 2:
Consider merge
of the continent lookup data frame twice to create home and away continent columns. And since you will have both continents, assign new shared continent column conditionally with numpy.where
:
df_cau2 = (
df.cau2.merge(
country_continent.reset_index(),
left_on = "home_team",
right_on = "Country",
how = "left"
).merge(
country_continent.reset_index(),
left_on = "away_team",
right_on = "Country",
how = "left",
suffixes = ["_home", "_away"]
)
)
df_cau2["shared_continent"] = np.where(
df_cau2["Continent_home"].eq(df_cau2["Continent_away"]),
df_cau2["Continent_home"],
np.nan
)