Using replace() to remove full sentences from text data in Python [duplicate]
I am trying to remove three sentences from paragraphs of text data. I have a pandas dataframe with rows of paragraphs that I want to remove the same three sentences from. For example,
import pandas as pd
df_1 = pd.DataFrame({"text": ["the dog is red. He goes outside and runs.",
"i like dogs because they are fun. i don't like that dogs bark at mailmen",
"dogs bark at mailmen and i think its funny."]})
custom_stopwords = ["the dog is red", "i like dogs", "dogs bark at mailmen"]
for i in custom_stopwords:
df_1['text'] = df_1['text'].str.replace(i, '')
This method is working in this example I have provided, but it does not work on my actual data. The data I have is quite large, but I don't see why that would matter in this case. What is happening is some of my sentences will be removed and others will not. For example, I am unable to remove the word "installation(s)" without blocking out the parentheses with "/".
pandas.Series.str.replace
has a default keyword argument of regex=True
which means it assumes the replacements are regular expressions (like your "installation(s)" could be interpreted). You're trying to replace string literals (or non-regular expressions at the very least). Adding regex=False
should work fine:
for i in custom_stopwords:
df_1['text'] = df_1['text'].str.replace(i, '', regex=False)
Use str.replace
with the argument regex=False
. (s)
is interpreted as a regular expression group, in this specific case equal to the character s
.