Using BeautifulSoup to search HTML for string
Solution 1:
The following line is looking for the exact NavigableString 'Python':
>>> soup.body.findAll(text='Python')
[]
Note that the following NavigableString is found:
>>> soup.body.findAll(text='Python Jobs')
[u'Python Jobs']
Note this behaviour:
>>> import re
>>> soup.body.findAll(text=re.compile('^Python$'))
[]
So your regexp is looking for an occurrence of 'Python' not the exact match to the NavigableString 'Python'.
Solution 2:
text='Python'
searches for elements that have the exact text you provided:
import re
from BeautifulSoup import BeautifulSoup
html = """<p>exact text</p>
<p>almost exact text</p>"""
soup = BeautifulSoup(html)
print soup(text='exact text')
print soup(text=re.compile('exact text'))
Output
[u'exact text']
[u'exact text', u'almost exact text']
"To see if the string 'Python' is located on the page http://python.org":
import urllib2
html = urllib2.urlopen('http://python.org').read()
print 'Python' in html # -> True
If you need to find a position of substring within a string you could do html.find('Python')
.
Solution 3:
In addition to the accepted answer. You can use a lambda
instead of regex
:
from bs4 import BeautifulSoup
html = """<p>test python</p>"""
soup = BeautifulSoup(html, "html.parser")
print(soup(text="python"))
print(soup(text=lambda t: "python" in t))
Output:
[]
['test python']
Solution 4:
I have not used BeuatifulSoup but maybe the following can help in some tiny way.
import re
import urllib2
stuff = urllib2.urlopen(your_url_goes_here).read() # stuff will contain the *entire* page
# Replace the string Python with your desired regex
results = re.findall('(Python)',stuff)
for i in results:
print i
I'm not suggesting this is a replacement but maybe you can glean some value in the concept until a direct answer comes along.