Trying to merge 2 dataframes but get ValueError
These are my two dataframes saved in two variables:
> print(df.head())
>
club_name tr_jan tr_dec year
0 ADO Den Haag 1368 1422 2010
1 ADO Den Haag 1455 1477 2011
2 ADO Den Haag 1461 1443 2012
3 ADO Den Haag 1437 1383 2013
4 ADO Den Haag 1386 1422 2014
> print(rankingdf.head())
>
club_name ranking year
0 ADO Den Haag 12 2010
1 ADO Den Haag 13 2011
2 ADO Den Haag 11 2012
3 ADO Den Haag 14 2013
4 ADO Den Haag 17 2014
I'm trying to merge these two using this code:
new_df = df.merge(ranking_df, on=['club_name', 'year'], how='left')
The how='left' is added because I have less datapoints in my ranking_df than in my standard df.
The expected behaviour is as such:
> print(new_df.head())
>
club_name tr_jan tr_dec year ranking
0 ADO Den Haag 1368 1422 2010 12
1 ADO Den Haag 1455 1477 2011 13
2 ADO Den Haag 1461 1443 2012 11
3 ADO Den Haag 1437 1383 2013 14
4 ADO Den Haag 1386 1422 2014 17
But I get this error:
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
But I do not wish to use concat since I want to merge the trees not just add them on.
Another behaviour that's weird in my mind is that my code works if I save the first df to .csv and then load that .csv into a dataframe.
The code for that:
df = pd.DataFrame(data_points, columns=['club_name', 'tr_jan', 'tr_dec', 'year'])
df.to_csv('preliminary.csv')
df = pd.read_csv('preliminary.csv', index_col=0)
ranking_df = pd.DataFrame(rankings, columns=['club_name', 'ranking', 'year'])
new_df = df.merge(ranking_df, on=['club_name', 'year'], how='left')
I think that it has to do with the index_col=0 parameter. But I have no idea to fix it without having to save it, it doesn't matter much but is kind of an annoyance that I have to do that.
Solution 1:
In one of your dataframes the year is a string and the other it is an int64
you can convert it first and then join (e.g. df['year']=df['year'].astype(int)
or as RafaelC suggested df.year.astype(int)
)
Edit: Also note the comment by Anderson Zhu: Just in case you have None
or missing values in one of your dataframes, you need to use Int64
instead of int
. See the reference here.
Solution 2:
I found that my dfs both had the same type column (str
) but switching from join
to merge
solved the issue.
Solution 3:
@Arnon Rotem-Gal-Oz answer is right for the most part. But I would like to point out the difference between df['year']=df['year'].astype(int)
and df.year.astype(int)
. df.year.astype(int)
returns a view of the dataframe and doesn't not explicitly change the type, atleast in pandas 0.24.2. df['year']=df['year'].astype(int)
explicitly change the type because it's an assignment. I would argue that this is the safest way to permanently change the dtype of a column.
Example:
df = pd.DataFrame({'Weed': ['green crack', 'northern lights', 'girl scout
cookies'], 'Qty':[10,15,3]})
df.dtypes
Weed object, Qty int64
df['Qty'].astype(str)
df.dtypes
Weed object, Qty int64
Even setting the inplace arg to True doesn't help at times. I don't know why this happens though. In most cases inplace=True equals an explicit assignment.
df['Qty'].astype(str, inplace = True)
df.dtypes
Weed object, Qty int64
Now the assignment,
df['Qty'] = df['Qty'].astype(str)
df.dtypes
Weed object, Qty object