set difference for pandas
A simple pandas question:
Is there a drop_duplicates()
functionality to drop every row involved in the duplication?
An equivalent question is the following: Does pandas have a set difference for dataframes?
For example:
In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
In [7]: df1
Out[7]:
col1 col2
0 1 2
1 2 3
2 3 4
In [8]: df2
Out[8]:
col1 col2
0 4 6
1 2 3
2 5 5
so maybe something like df2.set_diff(df1)
will produce this:
col1 col2
0 4 6
2 5 5
However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.
By the way, I initially thought about an extension of the current drop_duplicates()
method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.
Thanks!
Solution 1:
Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
This step will get rid of any duplicates in the dataframes as well (index ignored)
set([(1, 2), (3, 4), (2, 3)]) # ds1
can then use set methods to find anything. Eg to find differences:
ds1.difference(ds2)
gives: set([(1, 2), (3, 4)])
can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:
pd.DataFrame(list(ds1.difference(ds2)))
Solution 2:
Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)
pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
It is fast and the result is
col1 col2
0 4 6
2 5 5
Solution 3:
from pandas import DataFrame
df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))
Solution 4:
Apply by the columns of the object you want to map (df2); find the rows that are not in the set (isin
is like a set operator)
In [32]: df2.apply(lambda x: df2.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 4 6
2 5 5
Same thing, but include all values in df1, but still per column in df2
In [33]: df2.apply(lambda x: df2.loc[~x.isin(df1.values.ravel()),x.name])
Out[33]:
col1 col2
0 NaN 6
2 5 5
2nd example
In [34]: g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})
In [35]: g.columns=df1.columns
In [36]: g
Out[36]:
col1 col2
0 1.2 4
1 1.5 4
2 1.3 4
In [32]: g.apply(lambda x: g.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 1.2 NaN
1 1.5 NaN
2 1.3 NaN
Note, in 0.13, there will be an isin
operator on the frame level, so something like: df2.isin(df1)
should be possible
Solution 5:
There are 3 methods which work, but two of them have some flaws.
Method 1 (Hash method):
It worked for all cases I tested.
df1.loc[:, "hash"] = df1.apply(lambda x: hash(tuple(x)), axis = 1)
df2.loc[:, "hash"] = df2.apply(lambda x: hash(tuple(x)), axis = 1)
df1 = df1.loc[~df1["hash"].isin(df2["hash"]), :]
Method 2 (Dict method):
It fails if DataFrames contain datetime columns.
df1 = df1.loc[~df1.isin(df2.to_dict(orient="list")).all(axis=1), :]
Method 3 (MultiIndex method):
I encountered cases when it failed on columns with None's or NaN's.
df1 = df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)