Remove non-ASCII characters from pandas column
I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:
+----------------------------------------------------------- | DB_user source count | +----------------------------------------------------------- | ???/"Ò|Z?)?]??C %??J A 10 | | ?D$ZGU ;@D??_???T(?) B 3 | | ?Q`H??M'?Y??KTK$?Ù‹???ЩJL4??*?_?? C 2 | +-----------------------------------------------------------
I was using this function, which I had come across while researching the problem on SO.
def filter_func(string):
for i in range(0,len(string)):
if (ord(string[i])< 32 or ord(string[i])>126
break
return ''
And then using the apply function:
df['DB_user'] = df.apply(filter_func,axis=1)
I keep getting the error:
'ord() expected a character, but string of length 66 found', u'occurred at index 2'
However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.
Could somebody help me out?
Thanks!
Solution 1:
you may try this:
df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
Solution 2:
A common trick is to perform ASCII encoding with the errors="ignore"
flag, then subsequently decoding it into ASCII:
df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii')
From python3.x and above, this is my recommended solution.
Minimal Code Sample
s = pd.Series(['Déjà vu', 'Ò|zz', ';test 123'])
s
0 Déjà vu
1 Ò|zz
2 ;test 123
dtype: object
s.str.encode('ascii', 'ignore').str.decode('ascii')
0 Dj vu
1 |zz
2 ;test 123
dtype: object
P.S.: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).
Solution 3:
You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:
df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
You can also simplify the join using a chained comparison:
''.join([i if 32 < ord(i) < 126 else " " for i in x])
You could also use string.printable
to filter the chars:
from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x]))
The fastest is to use translate:
from string import maketrans
del_chars = " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))
df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))
Interestingly that is faster than:
df['DB_user'] = df["DB_user"].str.translate(trans)
Solution 4:
A couple of the answers given here aren't correct. Simple validation:
s = pd.Series([chr(x) for x in range(256)])
s.loc[0]
>> '\x00'
s.replace({r'[^\x00-\x7F]+':''}, regex=True).loc[0]
>> '\x00' # FAIL
s.str.encode('ascii', 'ignore').str.decode('ascii').loc[0]
>> '\x00' # FAIL
s.apply(lambda x: ''.join([i if 32 < ord(i) < 126 else " " for i in x])).loc[0]
>> ' ' # Success!
import string
s.apply(lambda x: ''.join([" " if i not in string.printable else i for i in x])).loc[0]
>> ' ' # Looks good, but...
s.apply(lambda x: ''.join([" " if i not in string.printable else i for i in x])).loc[11]
>> '\x0b' # FAIL
del_chars = " ".join([chr(i) for i in list(range(32)) + list(range(127, 256))])
trans = str.maketrans(del_chars, " " * len(del_chars))
s.apply(lambda x: x.translate(trans)).loc[11]
>> ' ' # Success!
Conclusion: only the options in the accepted answer (from Padraic Cunningham) work reliably. There are some bizarre Python errors/typos in his second answer, amended here, but otherwise it should be the fastest.