What's the easiest way to escape HTML in Python?
Solution 1:
cgi.escape
is fine. It escapes:
-
<
to<
-
>
to>
-
&
to&
That is enough for all HTML.
EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:
data.encode('ascii', 'xmlcharrefreplace')
Don't forget to decode data
to unicode
first, using whatever encoding it was encoded.
However in my experience that kind of encoding is useless if you just work with unicode
all the time from start. Just encode at the end to the encoding specified in the document header (utf-8
for maximum compatibility).
Example:
>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'<a>bá</a>
Also worth of note (thanks Greg) is the extra quote
parameter cgi.escape
takes. With it set to True
, cgi.escape
also escapes double quote chars ("
) so you can use the resulting value in a XML/HTML attribute.
EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape
, which does the same except that quote
defaults to True.
Solution 2:
In Python 3.2 a new html
module was introduced, which is used for escaping reserved characters from HTML markup.
It has one function escape()
:
>>> import html
>>> html.escape('x > 2 && x < 7 single quote: \' double quote: "')
'x > 2 && x < 7 single quote: ' double quote: "'
Solution 3:
If you wish to escape HTML in a URL:
This is probably NOT what the OP wanted (the question doesn't clearly indicate in which context the escaping is meant to be used), but Python's native library urllib has a method to escape HTML entities that need to be included in a URL safely.
The following is an example:
#!/usr/bin/python
from urllib import quote
x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'
Find docs here