Extracting text from script tag using BeautifulSoup in Python
Solution 1:
Alternatively to the regex-based approach, you can parse the javascript code using slimit
module, that builds an Abstract Syntax Tree and gives you a way of getting all assignments and putting them into the dictionary:
from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = """
<html>
<head>
<title>My Sample Page</title>
<script>
$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: '[email protected]',
phone: '9999999999',
name: 'XYZ'
}
});
</script>
</head>
<body>
<h1>What a wonderful world</h1>
</body>
</html>
"""
# get the script tag contents from the html
soup = BeautifulSoup(data)
script = soup.find('script')
# parse js
parser = Parser()
tree = parser.parse(script.text)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
for node in nodevisitor.visit(tree)
if isinstance(node, ast.Assign)}
print fields
Prints:
{u'name': u"'XYZ'", u'url': u"'http://www.example.com'", u'type': u'"POST"', u'phone': u"'9999999999'", u'data': '', u'email': u"'[email protected]'"}
Among other fields, there are email
, name
and phone
that you are interested in.
Hope that helps.
Solution 2:
You can get the script
tag contents via BeautifulSoup
and then apply a regex to get the desired data.
Working example (based on what you've described in the question):
import re
from bs4 import BeautifulSoup
data = """
<html>
<head>
<title>My Sample Page</title>
<script>
$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: '[email protected]',
phone: '9999999999',
name: 'XYZ'
}
});
</script>
</head>
<body>
<h1>What a wonderful world</h1>
</body>
</html>
"""
soup = BeautifulSoup(data)
script = soup.find('script')
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']
Prints:
[email protected] 9999999999 XYZ
I don't really like the solution, since that regex approach is really fragile. All sorts of things can happen that would break it. I still think there is a better solution and we are missing a bigger picture here. Providing a link to that specific site would help a lot, but it is what it is.
UPD (fixing the code OP provided):
soup = BeautifulSoup(data, 'html.parser')
script = soup.html.find_next_sibling('script', text=re.compile(r"\$\(document\)\.ready"))
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']
prints:
[email protected] 9999999999 Shamita Shetty