Regex to capture html elements with their class name

Solution 1:

Regex is a poor choice for HTML parsing, but luckily this is trivial with BeautifulSoup:

from bs4 import BeautifulSoup

html = """<div class="header_container container_12">
        <div class="grid_5">
              <h1><a href="#">Logo Text Here</a></h1>
        </div>
        <div class="grid_7">
            <div class="menu_items"> 
                <a href="#" class="home active">Home</a><a href="#" class="portfolio">Portfolio</a> 
               <a href="#" 
                class="about">About Me
                </a><a href="#" class="contact">Contact Me</a> 
            </div>
        </div>
</div>"""
    
for elem in BeautifulSoup(html, "lxml").find_all(attrs={"class": True}):
    print(elem.attrs["class"], elem.name)

Output:

['header_container', 'container_12'] div
['grid_5'] div
['grid_7'] div
['menu_items'] div
['home', 'active'] a
['portfolio'] a
['about'] a
['contact'] a

You can put this into a dict as you desire, but be careful since more than one element will likely map to each bucket. All it'd tell you is that an element exists and has a certain tag name given a specific class name string or tuple in a specific order.

elems = {}

for elem in BeautifulSoup(html, "lxml").find_all(attrs={"class": True}):
    elems[tuple(elem.attrs["class"])] = elem.name

for k, v in elems.items():
    print(k, v)