Pandas select rows and columns based on boolean condition
Solution 1:
Let's break down your problem. You want to
- Filter rows based on some boolean condition
- You want to select a subset of columns from the result.
For the first point, the condition you'd need is -
df["col_z"] < m
For the second requirement, you'd want to specify the list of columns that you need -
["col_x", "col_y"]
How would you combine these two to produce an expected output with pandas? The most straightforward way is using loc
-
df.loc[df["col_z"] < m, ["col_x", "col_y"]]
The first argument selects rows, and the second argument selects columns.
More About loc
Think of this in terms of the relational algebra operations - selection and projection. If you're from the SQL world, this would be a relatable equivalent. The above operation, in SQL syntax, would look like this -
SELECT col_x, col_y # projection on columns
FROM df
WHERE col_z < m # selection on rows
pandas
loc allows you to specify index labels for selecting rows. For example, if you have a dataframe -
col_x col_y
a 1 4
b 2 5
c 3 6
To select index a
, and c
, and col_x
you'd use -
df.loc[['a', 'c'], ['col_x']]
col_x
a 1
c 3
Alternatively, for selecting by a boolean condition (using a series/array of bool
values, as your original question asks), where all values in col_x
are odd -
df.loc[(df.col_x % 2).ne(0), ['col_y']]
col_y
a 4
c 6
For details, df.col_x % 2
computes the modulus of each value with respect to 2
. The ne(0)
will then compare the value to 0
, and return True
if it isn't (all odd numbers are selected like this). Here's what that expression results in -
(df.col_x % 2).ne(0)
a True
b False
c True
Name: col_x, dtype: bool
Further Reading
- 10 Minutes to Pandas - Selection by Label
-
Indexing and selecting data
- Boolean indexing
- Selection with .loc in python
- pandas loc vs. iloc vs. ix vs. at vs. iat?