Python code to remove HTML tags from a string [duplicate]
I have a text like this:
text = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""
using pure Python, with no external module I want to have this:
>>> print remove_tags(text)
Title A long text..... a link
I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+
How can I do that?
Using a regex
Using a regex, you can clean everything inside <>
:
import re
# as per recommendation from @freylis, compile once only
CLEANR = re.compile('<.*?>')
def cleanhtml(raw_html):
cleantext = re.sub(CLEANR, '', raw_html)
return cleantext
Some HTML texts can also contain entities that are not enclosed in brackets, such as '&nsbm
'. If that is the case, then you might want to write the regex as
CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
This link contains more details on this.
Using BeautifulSoup
You could also use BeautifulSoup
additional package to find out all the raw text.
You will need to explicitly set a parser when calling BeautifulSoup
I recommend "lxml"
as mentioned in alternative answers (much more robust than the default one (html.parser
) (i.e. available without additional install).
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text
But it doesn't prevent you from using external libraries, so I recommend the first solution.
EDIT: To use lxml
you need to pip install lxml
.
Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is xml.etree
, which works (somewhat) similarly to the lxml example you mention:
def remove_tags(text):
return ''.join(xml.etree.ElementTree.fromstring(text).itertext())