Is there a way in beautiful soup to count the number of tags in a html page

I'm looking at creating a dictionary in python where the key is the html tag name and the value is the number of times the tag appeared. Is there a way to do this with beautiful soup or something else?


Solution 1:

BeautifulSoup is really good for HTML parsing, and you could certainly use it for this purpose. It would be extremely simple:

from bs4 import BeautifulSoup as BS

def num_apperances_of_tag(tag_name, html):
    soup = BS(html)
    return len(soup.find_all(tag_name))

Solution 2:

With BeautifulSoup you can search for all tags by omitting the search criteria:

# print all tags
for tag in soup.findAll():
    print tag.name # TODO: add/update dict

If you're only interested in the number of occurrences, BeautifulSoup may be a bit overkill in which case you could use the HTMLParser instead:

from HTMLParser import HTMLParser

class print_tags(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print tag # TODO: add/update dict

parser = print_tags()
parser.feed(html)

This will produce the same output.

To create the dictionary of { 'tag' : count } you could use collections.defaultdict:

from collections import defaultdict

occurrences = defaultdict(int)
# ...
occurrences[tag_name] += 1