Convert HTML entities to Unicode and vice versa

Solution 1:

As to the "vice versa" (which I needed myself, leading me to find this question, which didn't help, and subsequently another site which had the answer):

u'some string'.encode('ascii', 'xmlcharrefreplace')

will return a plain string with any non-ascii characters turned into XML (HTML) entities.

Solution 2:

You need to have BeautifulSoup.

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
    """Converts HTML entities to unicode.  For example '&' becomes '&'."""
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

def unicodeToHTMLEntities(text):
    """Converts unicode to HTML entities.  For example '&' becomes '&'."""
    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
    return text

text = "&, ®, <, >, ¢, £, ¥, €, §, ©"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;

Solution 3:

Update for Python 2.7 and BeautifulSoup4

Unescape -- Unicode HTML to unicode with htmlparser (Python 2.7 standard lib):

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Unescape -- Unicode HTML to unicode with bs4 (BeautifulSoup4):

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Escape -- Unicode to unicode HTML with bs4 (BeautifulSoup4):

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

Solution 4:

As hekevintran answer suggests, you may use cgi.escape(s) for encoding stings, but notice that encoding of quote is false by default in that function and it may be a good idea to pass the quote=True keyword argument alongside your string. But even by passing quote=True, the function won't escape single quotes ("'") (Because of these issues the function has been deprecated since version 3.2)

It's been suggested to use html.escape(s) instead of cgi.escape(s). (New in version 3.2)

Also html.unescape(s) has been introduced in version 3.4.

So in python 3.4 you can:

  • Use html.escape(text).encode('ascii', 'xmlcharrefreplace').decode() to convert special characters to HTML entities.
  • And html.unescape(text) for converting HTML entities back to plain-text representations.