Python, remove all non-alphabet chars from string
I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it
def mapfn(k, v):
print v
import re, string
pattern = re.compile('[\W_]+')
v = pattern.match(v)
print v
for w in v.split():
yield w, 1
I'm afraid I am not sure how to use the library re
or even regex for that matter. I am not sure how to apply the regex pattern to the incoming string (line of a book) v
properly to retrieve the new line without any non-alphanumeric chars.
Suggestions?
Solution 1:
Use re.sub
import re
regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'
Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)
regex = re.compile('[,\.!?]') #etc.
Solution 2:
If you prefer not to use regex, you might try
''.join([i for i in s if i.isalpha()])
Solution 3:
You can use the re.sub() function to remove these characters:
>>> import re
>>> re.sub("[^a-zA-Z]+", "", "ABC12abc345def")
'ABCabcdef'
re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH)
-
"[^a-zA-Z]+"
- look for any group of characters that are NOT a-zA-z. -
""
- Replace the matched characters with ""
Solution 4:
Try:
s = ''.join(filter(str.isalnum, s))
This will take every char from the string, keep only alphanumeric ones and build a string back from them.
Solution 5:
The fastest method is regex
#Try with regex first
t0 = timeit.timeit("""
s = r2.sub('', st)
""", setup = """
import re
r2 = re.compile(r'[^a-zA-Z0-9]', re.MULTILINE)
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)
#Try with join method on filter
t0 = timeit.timeit("""
s = ''.join(filter(str.isalnum, st))
""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""",
number = 1000000)
print(t0)
#Try with only join
t0 = timeit.timeit("""
s = ''.join(c for c in st if c.isalnum())
""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)
2.6002226710006653 Method 1 Regex
5.739747313000407 Method 2 Filter + Join
6.540099570000166 Method 3 Join