How to use Beautiful Soup to extract string in <script> tag?
In a given .html page, I have a script tag like so:
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("[email protected]");
}, 1000);
});</script>
How can I use Beautiful Soup to extract the email address?
To add a bit more to the @Bob's answer and assuming you need to also locate the script
tag in the HTML which may have other script
tags.
The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup
and extracting the email
value:
import re
from bs4 import BeautifulSoup
data = """
<body>
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("[email protected]");
}, 1000);
});</script>
</body>
"""
pattern = re.compile(r'\.val\("([^@]+@[^@]+\.[^@]+)"\);', re.MULTILINE | re.DOTALL)
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script", text=pattern)
if script:
match = pattern.search(script.text)
if match:
email = match.group(1)
print(email)
Prints: [email protected]
.
Here we are using a simple regular expression for the email address, but we can go further and be more strict about it but I doubt that would be practically necessary for this problem.
I ran into a similar problem and the issue seems to be that calling script_tag.text
returns an empty string. Instead, you have to call script_tag.string
. Maybe this changed in some version of BeautifulSoup?
Anyway, @alecxe's answer didn't work for me, so I modified their solution:
import re
from bs4 import BeautifulSoup
data = """
<body>
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("[email protected]");
}, 1000);
});</script>
</body>
"""
soup = BeautifulSoup(data, "html.parser")
script_tag = soup.find("script")
if script_tag:
# contains all of the script tag, e.g. "jQuery(window)..."
script_tag_contents = script_tag.string
# from there you can search the string using a regex, etc.
email = re.search(r'\.+val\("(.+)"\);', script_tag_contents).group(1)
print(email)
This prints [email protected]
.
not possible using only BeautifulSoup, but you can do it for example with BS + regular expressions
import re
from bs4 import BeautifulSoup as BS
html = """<script> ... </script>"""
bs = BS(html)
txt = bs.script.get_text()
email = re.match(r'.+val\("(.+?)"\);', txt).group(1)
or like this:
...
email = txt.split('.val("')[1].split('");')[0]
In order to get the string inside the <script>
tag, you can use .contents
or .string
.
data = """
<body>
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("[email protected]");
}, 1000);
});</script>
</body>
"""
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script")
inner_text_with_string = script.string
inner_text_with_content = script.contents[0]
print('inner_text_with_string', inner_text_with_string)
print('inner_text_with_content', inner_text_with_content)