How to implement a Boolean search with multiple columns in pandas
I have a pandas df and would like to accomplish something along these lines (in SQL terms):
SELECT * FROM df WHERE column1 = 'a' OR column2 = 'b' OR column3 = 'c' etc.
Now this works, for one column/value pair:
foo = df.loc[df['column']==value]
However, I'm not sure how to expand that to multiple column/value pairs.
- To be clear, each column matches a different value.
Solution 1:
You need to enclose multiple conditions in braces due to operator precedence and use the bitwise and (&
) and or (|
) operators:
foo = df[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')]
If you use and
or or
, then pandas is likely to moan that the comparison is ambiguous. In that case, it is unclear whether we are comparing every value in a series in the condition, and what does it mean if only 1 or all but 1 match the condition. That is why you should use the bitwise operators or the numpy np.all
or np.any
to specify the matching criteria.
There is also the query method: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.query.html
but there are some limitations mainly to do with issues where there could be ambiguity between column names and index values.
Solution 2:
A more concise--but not necessarily faster--method is to use DataFrame.isin()
and DataFrame.any()
In [27]: n = 10
In [28]: df = DataFrame(randint(4, size=(n, 2)), columns=list('ab'))
In [29]: df
Out[29]:
a b
0 0 0
1 1 1
2 1 1
3 2 3
4 2 3
5 0 2
6 1 2
7 3 0
8 1 1
9 2 2
[10 rows x 2 columns]
In [30]: df.isin([1, 2])
Out[30]:
a b
0 False False
1 True True
2 True True
3 True False
4 True False
5 False True
6 True True
7 False False
8 True True
9 True True
[10 rows x 2 columns]
In [31]: df.isin([1, 2]).any(1)
Out[31]:
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 True
9 True
dtype: bool
In [32]: df.loc[df.isin([1, 2]).any(1)]
Out[32]:
a b
1 1 1
2 1 1
3 2 3
4 2 3
5 0 2
6 1 2
8 1 1
9 2 2
[8 rows x 2 columns]