How to remove numbers from string terms in a pandas dataframe
I have a data frame similar to the one below:
Name Volume Value
May21 23 21321
James 12 12311
Adi22 11 4435
Hello 34 32454
Girl90 56 654654
I want the output to be in the format:
Name Volume Value
May 23 21321
James 12 12311
Adi 11 4435
Hello 34 32454
Girl 56 654654
Want to remove all the numbers from the Name column.
Closest I have come is doing it at a cell level with the following code:
result = ''.join([i for i in df['Name'][1] if not i.isdigit()])
Any idea how to do it in a better way at the series/dataframe level.
You can apply str.replace to the Name
column in combination with regular expressions:
import pandas as pd
# Example DataFrame
df = pd.DataFrame.from_dict({'Name' : ['May21', 'James', 'Adi22', 'Hello', 'Girl90'],
'Volume': [23, 12, 11, 34, 56],
'Value' : [21321, 12311, 4435, 32454, 654654]})
df['Name'] = df['Name'].str.replace('\d+', '')
print(df)
Output:
Name Value Volume
0 May 21321 23
1 James 12311 12
2 Adi 4435 11
3 Hello 32454 34
4 Girl 654654 56
In the regular expression \d
stands for "any digit" and +
stands for "one or more".
Thus, str.replace('\d+', '')
means: "Replace all occurring digits in the strings with nothing".
You can do it like so:
df.Name = df.Name.str.replace('\d+', '')
To play and explore, check the online Regular expression demo here: https://regex101.com/r/Y6gJny/2
Whatever is matched by the pattern \d+
i.e 1 or more digits, will be replaced by empty string.
Although the question sounds more general, the example input only contains trailing numbers. In this case you don't have to use regular expressions, since .rstrip
(also available via the .str
accessor of Series
objects) can do exactly this:
import string
df['Name'] = df['Name'].str.rstrip(string.digits)
Similarly, you can use .lstrip
to strip any digits from the start, or .strip
to remove any digits from the start and the end of each string.
.str
is not necessary. You can use pandas dataframe.replace or series.replace with regex=True
argument.
df.replace('\d+', '', regex=True)
if you want to change source dataframe use inplace=True
.
df.replace('\d+', '', regex=True, inplace=True)