create new column of dataframe base on value of another dataframe run fast?

i want to create a new columns for my df_cau2['continent']. first there r 2 df of mine:

country_continent
    Continent
Country 
Afghanistan Asia
Albania Europe
Algeria Africa
American Samoa  Oceania

and
df_cau2 
    date    home_team   away_team   home_score  away_score  tournament  city    country neutral
0   1872-11-30  Scotland    England 0   0   Friendly    Glasgow Scotland    False
1   1873-03-08  England Scotland    4   2   Friendly    London  England False
2   1874-03-07  Scotland    England 2   1   Friendly    Glasgow Scotland    False

to create new column continent i use apply for df_cau2 like this:


def same_continent(home,away):
    if country_continent.loc[home].Continent == country_continent.loc[away].Continent:
        return country_continent.loc[home].Continent
    return 'None'

df_cau2['continent']=df_cau2.apply(lambda x: same_continent(x['home_team'],x['away_team']),axis=1)
df_cau2.head()

with 39480 rows of df_cau2, this code run too slow, how can i change my code to run it's faster? i am thinking about using np.select but i don't know how to use it's in this case.

This is result that i want:

date    home_team   away_team   home_score  away_score  tournament  city    country neutral continent
7611    1970-09-11  Iran    Turkey  1   1   Friendly    Teheran Iran    False   None
31221   2009-03-11  Nepal   Pakistan    1   0   Friendly    Kathmandu   Nepal   False   Asia
32716   2010-11-17  Colombia    Peru    1   1   Friendly    Bogotá  Colombia    False   South America

Thanks

Solution 1:

IIUC, you want to set continent column only if home_team and away_team columns are in the same continent:

home_continent = df1['home_team'].map(df2.squeeze())
away_continent = df1['away_team'].map(df2.squeeze())
m = home_continent == away_continent
df1.loc[m, 'continent'] = home_continent.loc[m]
print(df1)

# Output
  home_team away_team continent
0    Canada   England       NaN
1    France     Spain    Europe
2     China     Japan      Asia

Setup a MRE

df1 = pd.DataFrame({'home_team': ['Canada', 'France', 'China'],
                    'away_team': ['England', 'Spain', 'Japan']})
print(df1)

df2 = pd.DataFrame({'Country': ['Canada', 'China', 'England',
                                'France', 'Japan', 'Spain'],
                    'Continent': ['North America', 'Asia', 'Europe',
                                  'Europe', 'Asia', 'Europe']}).set_index('Country')
print(df2)

# Output df1
  home_team away_team
0    Canada   England
1    France     Spain
2     China     Japan

# Output df2
             Continent
Country               
Canada   North America
China             Asia
England         Europe
France          Europe
Japan             Asia
Spain           Europe

Solution 2:

Consider merge of the continent lookup data frame twice to create home and away continent columns. And since you will have both continents, assign new shared continent column conditionally with numpy.where:

df_cau2 = (
    df.cau2.merge(
        country_continent.reset_index(),
        left_on = "home_team",
        right_on = "Country",
        how = "left"
    ).merge(
        country_continent.reset_index(),
        left_on = "away_team",
        right_on = "Country",
        how = "left",
        suffixes = ["_home", "_away"]
    )
)

df_cau2["shared_continent"] = np.where(
    df_cau2["Continent_home"].eq(df_cau2["Continent_away"]),
    df_cau2["Continent_home"],
    np.nan
)

Does anyone know how to create and delete a category with channels in Discord.js?

How to use Gekko to solve for optimal control for a reusable reentry vehicle

Entity Framework Core Database Table Valued Functions Mapping

How do we extract the data of specific row from excel data to pysimplegui input fileds

!Join in Serverless Framework is broken

selenium.UnsupportedCommandException: the requested resource could not be found, or a request

Using python how can I pass data between child and main processes with Queue?

Abort this node-fetch request without aborting all other requests

Given an array of integers, find the pair of adjacent elements that has the largest product and return that product

TypeError: this.player.sprites is not a function

Why someone prefer to define class method as `module SomeModule extend self` but not `class SomeClass ; def self.method`?

how to create conditional list in ansible task