Beautiful soup: Extract everything between two tags
I am using BeautifulSoup to extract data from HTML files. I want to get all of the information between two tags. This means that if I have an HTML section like this:
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
Then if I wanted all of the information between the first h1 and the second h1, the output would look like this:
Text <i>here</i> has no tag
<div>This is in a div</div>
I've tried nextsibling loops, but there always seems to be a catch. Is there a command in beautifulsoup that simply pulls everything (Text, newlines, divs, special characters) that is between element "A" and element "B"?
Solution 1:
One solution is to .extract()
all content in front of first <h1>
and after second <h1>
tag:
from bs4 import BeautifulSoup
html_doc = '''
This I <b>don't</b> want
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
This I <b>don't</b> want too
'''
soup = BeautifulSoup(html_doc, 'html.parser')
for c in list(soup.contents):
if c is soup.h1 or c.find_previous('h1') is soup.h1:
continue
c.extract()
for h1 in soup.select('h1'):
h1.extract()
print(soup)
Prints:
Text <i>here</i> has no tag
<div>This is in a div</div>
Solution 2:
Here is how, you can simply target their parent or you can wrap them in the container and extract all children of that parent you're targeting like the following
from bs4 import BeautifulSoup
content = """
<div class="container">
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
</div>
"""
soup = BeautifulSoup(content, 'html.parser')
results = soup.find('div').findChildren()
print(results)
or
print(soup.find('h1').findAllNext())