How to avoid replacing more than needed with str.replace? [duplicate]

I have a dataframe, which shows som text in one column and the language of the text in another column. Instead of having 'en', 'ru' etc. written in the language column, I have tried to turn these abbreviations into full words:

df['text'] = df['text'] \
                  .str.replace('en', 'English') \
                  .str.replace('ru', 'Russian') \
                  .str.replace('fr', 'French') \
                  .str.replace('tr', 'Turkish') \
                  .str.replace('es', 'Spanish')

# The number of languages goes on..

The issue, however, is that it finds 'en', for example, in other words (such as French), which doesn't give the best output, when I run the dataframe:

English                   959874
Russian                   419963
FrEnglishch                93797
Turkish                    87225
Spanish                    74120
PortuguSpanisHebrew        31627

# And so on..

How can I avoid that it searches for 'en', for instance, in all words and not only, when 'en' stands alone in a column?

You might consider using map instead of str.replace, which should be more efficient in your case as you only do dictionary lookups. Therefore, you just define a dictionary used as a lookup table that you pass to the map function. For your example, that dictionary would map short-form version to long-form. In code, that reads like

my_map = {"en": "English", "ru": "Russian", ...}
df['text'] = df.text.map(my_map)

You can use regex replace:

df['text'] = df['text'].str.replace(r'^en$', 'English')

In regex, ^ means start of line and $ means end of line.

So you'll basically say: Replace with English where from start of line it says en and ends there.

How to avoid replacing more than needed with str.replace? [duplicate]

Related

Recent Posts