Beautiful Soup cannot find a CSS class if the object has other classes, too
Unfortunately, BeautifulSoup treats this as a class with a space in it 'class1 class2'
rather than two classes ['class1','class2']
. A workaround is to use a regular expression to search for the class instead of a string.
This works:
soup.findAll(True, {'class': re.compile(r'\bclass1\b')})
Just in case anybody comes across this question. BeautifulSoup now supports this:
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.
In [1]: import bs4
In [2]: soup = bs4.BeautifulSoup('<div class="foo bar"></div>')
In [3]: soup(attrs={'class': 'bar'})
Out[3]: [<div class="foo bar"></div>]
Also, you don't have to type findAll anymore.
You should use lxml. It works with multiple class values separated by spaces ('class1 class2').
Despite its name, lxml is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Bicking agrees and prefers lxml over BeautifulSoup.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
You can even use CSS selectors with lxml, so it's far easier to use than BeautifulSoup. Try playing around with it in an interactive Python console.
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
Like:
soup.find_all("a", class_="class1")