How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup? [duplicate]
As justhalf points out above, my question here is essentially a duplicate of this question.
The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.
This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:
soup = BeautifulSoup(response.read().decode('utf-8'))
I would get the error:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813:
invalid continuation byte
Looking more closely at the output, there was an instance of the character Ü
which was wrongly encoded as the invalid byte sequence 0xe3 0x9c
, rather than the correct 0xc3 0x9c
.
As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:
soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
Encoding the result to utf-8
seems to work for me:
print (soup.find('div', id='navbutton_account')['title']).encode('utf-8')
It yields:
Hier können Sie sich kostenlos registrieren und / oder einloggen!