How to remove \xa0 from string in Python?
I am currently using Beautiful Soup to parse an HTML file and calling get_text()
, but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?
I tried using: line = line.replace(u'\xa0',' ')
, as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):
EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8')
, but just doing .encode('utf-8')
without replace()
seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?
Solution 1:
\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.
string = string.replace(u'\xa0', u' ')
When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.
Read up on http://docs.python.org/howto/unicode.html.
Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize
now
Solution 2:
There's many useful things in Python's unicodedata
library. One of them is the .normalize()
function.
Try:
new_str = unicodedata.normalize("NFKD", unicode_str)
Replacing NFKD with any of the other methods listed in the link above if you don't get the results you're after.
Solution 3:
Try using .strip() at the end of your line
line.strip()
worked well for me
Solution 4:
After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.
Assume we have our raw html as following:
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
So lets try to clean this HTML string:
from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'
The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.
Method # 1 (Recommended): The first one is BeautifulSoup's get_text method with strip argument as True So our code becomes:
clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks
Method # 2: The other option is to use python's library unicodedata
import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'
I have also detailed these methods on this blog which you may want to refer.
Solution 5:
try this:
string.replace('\\xa0', ' ')