Regex to capture html elements with their class name
Solution 1:
Regex is a poor choice for HTML parsing, but luckily this is trivial with BeautifulSoup:
from bs4 import BeautifulSoup
html = """<div class="header_container container_12">
<div class="grid_5">
<h1><a href="#">Logo Text Here</a></h1>
</div>
<div class="grid_7">
<div class="menu_items">
<a href="#" class="home active">Home</a><a href="#" class="portfolio">Portfolio</a>
<a href="#"
class="about">About Me
</a><a href="#" class="contact">Contact Me</a>
</div>
</div>
</div>"""
for elem in BeautifulSoup(html, "lxml").find_all(attrs={"class": True}):
print(elem.attrs["class"], elem.name)
Output:
['header_container', 'container_12'] div
['grid_5'] div
['grid_7'] div
['menu_items'] div
['home', 'active'] a
['portfolio'] a
['about'] a
['contact'] a
You can put this into a dict as you desire, but be careful since more than one element will likely map to each bucket. All it'd tell you is that an element exists and has a certain tag name given a specific class name string or tuple in a specific order.
elems = {}
for elem in BeautifulSoup(html, "lxml").find_all(attrs={"class": True}):
elems[tuple(elem.attrs["class"])] = elem.name
for k, v in elems.items():
print(k, v)