BeautifulSoup return unexpected extra spaces
I believe this is a bug with Lxml's HTML parser. Try:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen ("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova.replace('ISO-8859-1', 'utf-8'))
print soup
Which is a workaround for the problem. I believe the issue was fixed in lxml 3.0 alpha 2 and lxml 2.3.6, so it could be worth checking whether you need to upgrade to a newer version.
If you want more info on the bug it was initially filed here:
https://bugs.launchpad.net/beautifulsoup/+bug/972466
Hope this helps,
Hayden
You can specify the parser as html.parser
:
soup = BeautifulSoup(prova, 'html.parser')
Also you can specify the html5
parser:
soup = BeautifulSoup(prova, 'html5')
Haven't installed the html5
parser yet? Install it from terminal:
sudo apt-get install python-html5lib
The xml
parser may be used (soup = BeautifulSoup(prova, 'xml')
) but you may see some differences in multi-valued attributes like class="foo bar"
.