How to delete rows from a pandas DataFrame based on a conditional expression [duplicate]
I have a pandas DataFrame and I want to delete rows from it where the length of the string in a particular column is greater than 2.
I expect to be able to do this (per this answer):
df[(len(df['column name']) < 2)]
but I just get the error:
KeyError: u'no item named False'
What am I doing wrong?
(Note: I know I can use df.dropna()
to get rid of rows that contain any NaN
, but I didn't see how to remove rows based on a conditional expression.)
Solution 1:
To directly answer this question's original title "How to delete rows from a pandas DataFrame based on a conditional expression" (which I understand is not necessarily the OP's problem but could help other users coming across this question) one way to do this is to use the drop method:
df = df.drop(some labels)
df = df.drop(df[<some boolean condition>].index)
Example
To remove all rows where column 'score' is < 50:
df = df.drop(df[df.score < 50].index)
In place version (as pointed out in comments)
df.drop(df[df.score < 50].index, inplace=True)
Multiple conditions
(see Boolean Indexing)
The operators are:
|
foror
,&
forand
, and~
fornot
. These must be grouped by using parentheses.
To remove all rows where column 'score' is < 50 and > 20
df = df.drop(df[(df.score < 50) & (df.score > 20)].index)
Solution 2:
When you do len(df['column name'])
you are just getting one number, namely the number of rows in the DataFrame (i.e., the length of the column itself). If you want to apply len
to each element in the column, use df['column name'].map(len)
. So try
df[df['column name'].map(len) < 2]
Solution 3:
You can assign the DataFrame
to a filtered version of itself:
df = df[df.score > 50]
This is faster than drop
:
%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test[test.x < 0]
# 54.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test.drop(test[test.x > 0].index, inplace=True)
# 201 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test.drop(test[test.x > 0].index)
# 194 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 4:
I will expand on @User's generic solution to provide a drop
free alternative. This is for folks directed here based on the question's title (not OP 's problem)
Say you want to delete all rows with negative values. One liner solution is:-
df = df[(df > 0).all(axis=1)]
Step by step Explanation:--
Let's generate a 5x5 random normal distribution data frame
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'))
A B C D E
0 1.764052 0.400157 0.978738 2.240893 1.867558
1 -0.977278 0.950088 -0.151357 -0.103219 0.410599
2 0.144044 1.454274 0.761038 0.121675 0.443863
3 0.333674 1.494079 -0.205158 0.313068 -0.854096
4 -2.552990 0.653619 0.864436 -0.742165 2.269755
Let the condition be deleting negatives. A boolean df satisfying the condition:-
df > 0
A B C D E
0 True True True True True
1 False True False False True
2 True True True True True
3 True True False True False
4 False True True False True
A boolean series for all rows satisfying the condition Note if any element in the row fails the condition the row is marked false
(df > 0).all(axis=1)
0 True
1 False
2 True
3 False
4 False
dtype: bool
Finally filter out rows from data frame based on the condition
df[(df > 0).all(axis=1)]
A B C D E
0 1.764052 0.400157 0.978738 2.240893 1.867558
2 0.144044 1.454274 0.761038 0.121675 0.443863
You can assign it back to df to actually delete vs filter ing done abovedf = df[(df > 0).all(axis=1)]
This can easily be extended to filter out rows containing NaN s (non numeric entries):-df = df[(~df.isnull()).all(axis=1)]
This can also be simplified for cases like: Delete all rows where column E is negative
df = df[(df.E>0)]
I would like to end with some profiling stats on why @User's drop
solution is slower than raw column based filtration:-
%timeit df_new = df[(df.E>0)]
345 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dft.drop(dft[dft.E < 0].index, inplace=True)
890 µs ± 94.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
A column is basically a Series
i.e a NumPy
array, it can be indexed without any cost. For folks interested in how the underlying memory organization plays into execution speed here is a great Link on Speeding up Pandas: