Iterating over pandas df returning all values not matchin a regex
I am trying to iterate over a columns to identify non-valid entries. This works
weirdos = df.loc[df[column] == '7282'][['col1', 'col2']]
but trying the same with regex like
regex = "^[a-zA-Z]{2}[*]{1}[a-zA-Z0-9]{3}[*]{1}[a-zA-Z0-9*]{0,30}$"
weirdos = df.loc[re.search(regex, df[column]) is not None][['col1', 'col2']]
keeps getting the error TypeError: expected string or bytes-like object
. Any hints?
Assuming column
(which is not enclosed in a pair of quotes '
) is a string variable containing the column name to check, use:
weirdos = df.loc[~df[column].str.contains(regex)][['col1', 'col2']]
Note that you have to use str.contains()
instead of str.match()
in order to adhere to your original code using re.search()
. This is because str.contains()
underlying uses re.search()
while str.match()
uses re.match()
which search for matches at the beginning of text only.
The ~
is added in the filtering condition because of your question title mentioning NOT matching a regex You can remove it if you intend for matching instead.
One reminder is to define the regex under raw string, i.e. regex = r'....'
so that you don't need to escape each regex symbol.
Test Run
data = {'col_0': ['baa', 'bbc', 'ccd'], 'col1': [10, 20, 30], 'col2': [100, 200, 300]}
df = pd.DataFrame(data)
print(df)
Output:
col_0 col1 col2
0 baa 10 100
1 bbc 20 200
2 ccd 30 300
regex = r'aa' # containing 'aa' anywhere in string
column = 'col_0'
weirdos = df.loc[~df[column].str.contains(regex)][['col1', 'col2']] # filtering those NOT containing 'aa' anywhere in string
print(weirdos)
Output:
col1 col2
1 20 200
2 30 300