How to filter rows containing a string pattern from a Pandas dataframe [duplicate]
Assume we have a data frame in Python Pandas that looks like this:
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': [u'aball', u'bball', u'cnut', u'fball']})
Or, in table form:
ids vals
aball 1
bball 2
cnut 3
fball 4
How do I filter rows which contain the key word "ball?" For example, the output should be:
ids vals
aball 1
bball 2
fball 4
Solution 1:
In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
ids vals
0 aball 1
1 bball 2
3 fball 4
Solution 2:
df[df['ids'].str.contains('ball', na = False)] # valid for (at least) pandas version 0.17.1
Step-by-step explanation (from inner to outer):
-
df['ids']
selects theids
column of the data frame (technically, the objectdf['ids']
is of typepandas.Series
) -
df['ids'].str
allows us to apply vectorized string methods (e.g.,lower
,contains
) to the Series -
df['ids'].str.contains('ball')
checks each element of the Series as to whether the element value has the string 'ball' as a substring. The result is a Series of Booleans indicatingTrue
orFalse
about the existence of a 'ball' substring. -
df[df['ids'].str.contains('ball')]
applies the Boolean 'mask' to the dataframe and returns a view containing appropriate records. -
na = False
removes NA / NaN values from consideration; otherwise a ValueError may be returned.
Solution 3:
>>> mask = df['ids'].str.contains('ball')
>>> mask
0 True
1 True
2 False
3 True
Name: ids, dtype: bool
>>> df[mask]
ids vals
0 aball 1
1 bball 2
3 fball 4
Solution 4:
If you want to set the column you filter on as a new index, you could also consider to use .filter
; if you want to keep it as a separate column then str.contains
is the way to go.
Let's say you have
df = pd.DataFrame({'vals': [1, 2, 3, 4, 5], 'ids': [u'aball', u'bball', u'cnut', u'fball', 'ballxyz']})
ids vals
0 aball 1
1 bball 2
2 cnut 3
3 fball 4
4 ballxyz 5
and your plan is to filter all rows in which ids
contains ball
AND set ids
as new index, you can do
df.set_index('ids').filter(like='ball', axis=0)
which gives
vals
ids
aball 1
bball 2
fball 4
ballxyz 5
But filter
also allows you to pass a regex, so you could also filter only those rows where the column entry ends with ball
. In this case you use
df.set_index('ids').filter(regex='ball$', axis=0)
vals
ids
aball 1
bball 2
fball 4
Note that now the entry with ballxyz
is not included as it starts with ball
and does not end with it.
If you want to get all entries that start with ball
you can simple use
df.set_index('ids').filter(regex='^ball', axis=0)
yielding
vals
ids
ballxyz 5
The same works with columns; all you then need to change is the axis=0
part. If you filter based on columns, it would be axis=1
.