How to avoid replacing more than needed with str.replace? [duplicate]
I have a dataframe, which shows som text in one column and the language of the text in another column. Instead of having 'en', 'ru' etc. written in the language column, I have tried to turn these abbreviations into full words:
df['text'] = df['text'] \
.str.replace('en', 'English') \
.str.replace('ru', 'Russian') \
.str.replace('fr', 'French') \
.str.replace('tr', 'Turkish') \
.str.replace('es', 'Spanish')
# The number of languages goes on..
The issue, however, is that it finds 'en', for example, in other words (such as French), which doesn't give the best output, when I run the dataframe:
English 959874
Russian 419963
FrEnglishch 93797
Turkish 87225
Spanish 74120
PortuguSpanisHebrew 31627
# And so on..
How can I avoid that it searches for 'en', for instance, in all words and not only, when 'en' stands alone in a column?
You might consider using map
instead of str.replace
, which should be more efficient in your case as you only do dictionary lookups. Therefore, you just define a dictionary used as a lookup table that you pass to the map function. For your example, that dictionary would map short-form version to long-form. In code, that reads like
my_map = {"en": "English", "ru": "Russian", ...}
df['text'] = df.text.map(my_map)
You can use regex replace:
df['text'] = df['text'].str.replace(r'^en$', 'English')
In regex, ^
means start of line and $
means end of line.
So you'll basically say:
Replace with English
where from start of line it says en
and ends there.