How to use Beautiful Soup to extract string in <script> tag?

In a given .html page, I have a script tag like so:

     <script>jQuery(window).load(function () {
  setTimeout(function(){
    jQuery("input[name=Email]").val("[email protected]");
  }, 1000);
});</script>

How can I use Beautiful Soup to extract the email address?

To add a bit more to the @Bob's answer and assuming you need to also locate the script tag in the HTML which may have other script tags.

The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup and extracting the email value:

import re

from bs4 import BeautifulSoup


data = """
<body>
    <script>jQuery(window).load(function () {
      setTimeout(function(){
        jQuery("input[name=Email]").val("[email protected]");
      }, 1000);
    });</script>
</body>
"""
pattern = re.compile(r'\.val\("([^@]+@[^@]+\.[^@]+)"\);', re.MULTILINE | re.DOTALL)
soup = BeautifulSoup(data, "html.parser")

script = soup.find("script", text=pattern)
if script:
    match = pattern.search(script.text)
    if match:
        email = match.group(1)
        print(email)

Prints: [email protected].

Here we are using a simple regular expression for the email address, but we can go further and be more strict about it but I doubt that would be practically necessary for this problem.

I ran into a similar problem and the issue seems to be that calling script_tag.text returns an empty string. Instead, you have to call script_tag.string. Maybe this changed in some version of BeautifulSoup?

Anyway, @alecxe's answer didn't work for me, so I modified their solution:

import re

from bs4 import BeautifulSoup

data = """
<body>
    <script>jQuery(window).load(function () {
      setTimeout(function(){
        jQuery("input[name=Email]").val("[email protected]");
      }, 1000);
    });</script>
</body>
"""
soup = BeautifulSoup(data, "html.parser")

script_tag = soup.find("script")
if script_tag:
  # contains all of the script tag, e.g. "jQuery(window)..."
  script_tag_contents = script_tag.string

  # from there you can search the string using a regex, etc.
  email = re.search(r'\.+val\("(.+)"\);', script_tag_contents).group(1)
  print(email)

This prints [email protected].

not possible using only BeautifulSoup, but you can do it for example with BS + regular expressions

import re
from bs4 import BeautifulSoup as BS

html = """<script> ... </script>"""

bs = BS(html)

txt = bs.script.get_text()

email = re.match(r'.+val\("(.+?)"\);', txt).group(1)

or like this:

...

email = txt.split('.val("')[1].split('");')[0]

In order to get the string inside the <script> tag, you can use .contents or .string.

data = """
   <body>
<script>jQuery(window).load(function () {
  setTimeout(function(){
    jQuery("input[name=Email]").val("[email protected]");
  }, 1000);
});</script>
 </body>
    """
soup = BeautifulSoup(data, "html.parser")

script = soup.find("script")
inner_text_with_string = script.string
inner_text_with_content = script.contents[0]

print('inner_text_with_string', inner_text_with_string)
print('inner_text_with_content', inner_text_with_content)

How to use Beautiful Soup to extract string in <script> tag?

Related

Recent Posts