How to filter rows in pandas by regex
I would like to cleanly filter a dataframe using regex on one of the columns.
For a contrived example:
In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]:
a b
0 1 hi
1 2 foo
2 3 fat
3 4 cat
I want to filter the rows to those that start with f
using a regex. First go:
In [213]: foo.b.str.match('f.*')
Out[213]:
0 []
1 ()
2 ()
3 []
That's not too terribly useful. However this will get me my boolean index:
In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]:
0 False
1 True
2 True
3 False
Name: b
So I could then do my restriction by:
In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]:
a b
1 2 foo
2 3 fat
That makes me artificially put a group into the regex though, and seems like maybe not the clean way to go. Is there a better way to do this?
Use contains instead:
In [10]: df.b.str.contains('^f')
Out[10]:
0 False
1 True
2 True
3 False
Name: b, dtype: bool
There is already a string handling function Series.str.startswith()
.
You should try foo[foo.b.str.startswith('f')]
.
Result:
a b
1 2 foo
2 3 fat
I think what you expect.
Alternatively you can use contains with regex option. For example:
foo[foo.b.str.contains('oo', regex= True, na=False)]
Result:
a b
1 2 foo
na=False
is to prevent Errors in case there is nan, null etc. values