Is there a way in beautiful soup to count the number of tags in a html page
I'm looking at creating a dictionary in python where the key is the html tag name and the value is the number of times the tag appeared. Is there a way to do this with beautiful soup or something else?
Solution 1:
BeautifulSoup is really good for HTML parsing, and you could certainly use it for this purpose. It would be extremely simple:
from bs4 import BeautifulSoup as BS
def num_apperances_of_tag(tag_name, html):
soup = BS(html)
return len(soup.find_all(tag_name))
Solution 2:
With BeautifulSoup you can search for all tags by omitting the search criteria:
# print all tags
for tag in soup.findAll():
print tag.name # TODO: add/update dict
If you're only interested in the number of occurrences, BeautifulSoup may be a bit overkill in which case you could use the HTMLParser
instead:
from HTMLParser import HTMLParser
class print_tags(HTMLParser):
def handle_starttag(self, tag, attrs):
print tag # TODO: add/update dict
parser = print_tags()
parser.feed(html)
This will produce the same output.
To create the dictionary of { 'tag' : count }
you could use collections.defaultdict
:
from collections import defaultdict
occurrences = defaultdict(int)
# ...
occurrences[tag_name] += 1