Rendered HTML to plain text using Python

BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text. For example:

import html2text
html = open("foobar.html").read()
print html2text.html2text(html)

This outputs:

Some text more text even more text

  * list item
  * yet another list item

Some other text

  * list item
  * yet another list item

I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.

On a differet SO question: BeautifulSoup get_text does not strip all tags and JavaScript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.

I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

Answer from @Helge (nltk).

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

Answer above from @del

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

Rendered HTML to plain text using Python

Related

Recent Posts