How to "select distinct" across multiple data frame columns in pandas?
I'm looking for a way to do the equivalent to the SQL
SELECT DISTINCT col1, col2 FROM dataframe_table
The pandas sql comparison doesn't have anything about distinct
.
.unique()
only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.
Am I missing something obvious, or is there no way to do this?
You can use the drop_duplicates
method to get the unique rows in a DataFrame:
In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})
In [30]: df
Out[30]:
a b
0 1 3
1 2 4
2 1 3
3 2 5
In [32]: df.drop_duplicates()
Out[32]:
a b
0 1 3
1 2 4
3 2 5
You can also provide the subset
keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.
I've tried different solutions. First was:
a_df=np.unique(df[['col1','col2']], axis=0)
and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()
a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]
You can also use SQL to do this, but it worked very slow in my case:
from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)