Removing signs and repeating numbers

Solution 1:

Another way using pandas.Series.str.partition with replace:

data["salary"].str.partition("/")[0].str.replace("[^\d-]+", "", regex=True)

Output:

0     26768-30136
1     26000-28000
2           21000
3     26768-30136
4              33
5     18500-20500
6     27500-30000
7     35000-40000
8     24000-27000
9     19000-24000
10    30000-35000
11    44000-66000
12          75-90
Name: 0, dtype: object

Explain:

It assumes that you are only interested in the parts upto /; it extracts everything until /, than removes anything but digits and hypen

Solution 2:

You can use

data['salary'].str.split('/', n=1).str[0].replace('[^\d-]+','', regex=True)
# 0     26768-30136
# 1     26000-28000
# 2           21000
# 3     26768-30136
# 4              33
# 5     18500-20500
# 6     27500-30000
# 7     35000-40000
# 8     24000-27000
# 9     19000-24000
# 10    30000-35000
# 11    44000-66000
# 12          75-90

Here,

  • .str.split('/', n=1) - splits into two parts with the first / char
  • .str[0] - gets the first item
  • .replace('[^\d-]+','', regex=True) - removes all chars other than digits and hyphens.

A more precise solution is to extract the £num(-£num)? pattern and remove all non-digits/hyphens:

data['salary'].str.extract(r'£(\d+(?:,\d+)*(?:\.\d+)?(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)?)')[0].str.replace(r'[^\d-]+', '', regex=True)

Details:

  • £ - a literal char
  • \d+(?:,\d+)*(?:\.\d+)? - one or more digits, followed with zero or more occurrences of a comma and one or more digits and then an optional sequence of a dot and one or more digits
  • (?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)? - an optional occurrence of a hyphen enclosed with zero or more whitespaces (\s*-\s*), then a £ char, and a number pattern described above.