Web-scraping JavaScript page with Python

I'm trying to develop a simple web scraper. I want to extract text without the HTML code. In fact, I achieve this goal, but I have seen that in some pages where JavaScript is loaded I didn't obtain good results.

For example, if some JavaScript code adds some text, I can't see it, because when I call

response = urllib2.urlopen(request)

I get the original text without the added one (because JavaScript is executed in the client).

So, I'm looking for some ideas to solve this problem.

EDIT Sept 2021: phantomjs isn't maintained any more, either

EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end.

dryscape isn't maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium's python library with Phantom JS as a web driver fast enough and easy to get the work done.

Once you have installed Phantom JS, make sure the phantomjs binary is available in the current path:

phantomjs --version
# result:
2.1.1

#Example To give an example, I created a sample page with following HTML code. (link):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

without javascript it says: No javascript support and with javascript: Yay! Supports javascript

#Scraping without JS support:

import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>

#Scraping with JS support:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'

You can also use Python library dryscrape to scrape javascript driven websites.

#Scraping with JS support:

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>

We are not getting the correct results because any javascript generated content needs to be rendered on the DOM. When we fetch an HTML page, we fetch the initial, unmodified by javascript, DOM.

Therefore we need to render the javascript content before we crawl the page.

As selenium is already mentioned many times in this thread (and how slow it gets sometimes was mentioned also), I will list two other possible solutions.

Solution 1: This is a very nice tutorial on how to use Scrapy to crawl javascript generated content and we are going to follow just that.

What we will need:

Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform.
Install Splash following the instruction listed for our corresponding OS.
Quoting from splash documentation:

Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5.

Essentially we are going to use Splash to render Javascript generated content.
Run the splash server: sudo docker run -p 8050:8050 scrapinghub/splash.
Install the scrapy-splash plugin: pip install scrapy-splash
Assuming that we already have a Scrapy project created (if not, let's make one), we will follow the guide and update the settings.py:
Then go to your scrapy project’s settings.py and set these middlewares:
```
DOWNLOADER_MIDDLEWARES = {
      'scrapy_splash.SplashCookiesMiddleware': 723,
      'scrapy_splash.SplashMiddleware': 725,
      'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
```
The URL of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?):
```
SPLASH_URL = 'http://localhost:8050'
```
And finally you need to set these values too:
```
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
```

Finally, we can use a SplashRequest:

In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example:
class MySpider(scrapy.Spider):
    name = "jsscraper"
    start_urls = ["http://quotes.toscrape.com/js/"]

    def start_requests(self):
        for url in self.start_urls:
        yield SplashRequest(
            url=url, callback=self.parse, endpoint='render.html'
        )

    def parse(self, response):
        for q in response.css("div.quote"):
        quote = QuoteItem()
        quote["author"] = q.css(".author::text").extract_first()
        quote["quote"] = q.css(".text::text").extract_first()
        yield quote
SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.

Solution 2: Let's call this experimental at the moment (May 2018)...
This solution is for Python's version 3.6 only (at the moment).

Do you know the requests module (well who doesn't)?
Now it has a web crawling little sibling: requests-HTML:

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

Install requests-html: pipenv install requests-html

Make a request to the page's url:

from requests_html import HTMLSession

session = HTMLSession()
r = session.get(a_page_url)

Render the response to get the Javascript generated bits:
```
r.html.render()
```

Finally, the module seems to offer scraping capabilities.
Alternatively, we can try the well-documented way of using BeautifulSoup with the r.html object we just rendered.

Maybe selenium can do it.

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source

If you have ever used the Requests module for python before, I recently found out that the developer created a new module called Requests-HTML which now also has the ability to render JavaScript.

You can also visit https://html.python-requests.org/ to learn more about this module, or if your only interested about rendering JavaScript then you can visit https://html.python-requests.org/?#javascript-support to directly learn how to use the module to render JavaScript using Python.

Essentially, Once you correctly install the Requests-HTML module, the following example, which is shown on the above link, shows how you can use this module to scrape a website and render JavaScript contained within the website:

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('http://python-requests.org/')

r.html.render()

r.html.search('Python 2 will retire in only {months} months!')['months']

'<time>25</time>' #This is the result.

I recently learnt about this from a YouTube video. Click Here! to watch the YouTube video, which demonstrates how the module works.

It sounds like the data you're really looking for can be accessed via secondary URL called by some javascript on the primary page.

While you could try running javascript on the server to handle this, a simpler approach to might be to load up the page using Firefox and use a tool like Charles or Firebug to identify exactly what that secondary URL is. Then you can just query that URL directly for the data you are interested in.

Web-scraping JavaScript page with Python

Related

Recent Posts