Removing signs and repeating numbers
Solution 1:
Another way using pandas.Series.str.partition
with replace
:
data["salary"].str.partition("/")[0].str.replace("[^\d-]+", "", regex=True)
Output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
Name: 0, dtype: object
Explain:
It assumes that you are only interested in the parts upto /
; it extracts everything until /
, than removes anything but digits and hypen
Solution 2:
You can use
data['salary'].str.split('/', n=1).str[0].replace('[^\d-]+','', regex=True)
# 0 26768-30136
# 1 26000-28000
# 2 21000
# 3 26768-30136
# 4 33
# 5 18500-20500
# 6 27500-30000
# 7 35000-40000
# 8 24000-27000
# 9 19000-24000
# 10 30000-35000
# 11 44000-66000
# 12 75-90
Here,
-
.str.split('/', n=1)
- splits into two parts with the first/
char -
.str[0]
- gets the first item -
.replace('[^\d-]+','', regex=True)
- removes all chars other than digits and hyphens.
A more precise solution is to extract the £num(-£num)?
pattern and remove all non-digits/hyphens:
data['salary'].str.extract(r'£(\d+(?:,\d+)*(?:\.\d+)?(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)?)')[0].str.replace(r'[^\d-]+', '', regex=True)
Details:
-
£
- a literal char -
\d+(?:,\d+)*(?:\.\d+)?
- one or more digits, followed with zero or more occurrences of a comma and one or more digits and then an optional sequence of a dot and one or more digits -
(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)?
- an optional occurrence of a hyphen enclosed with zero or more whitespaces (\s*-\s*
), then a£
char, and a number pattern described above.