Scrape web page data generated by javascript

Solution 1:

You need to look at PhantomJS.

From their site:

PhantomJS is a headless WebKit with JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

Using the API you can script the "browser" to interact with that page and scrape the data you need. You can then do whatever you need with it; including passing it to a PHP script if necessary.


That being said, if at all possible try not to "scrape" the data. If there is an ajax call the page is making, maybe there is an API you can use instead? If not, maybe you can convince them to make one. That would of course be much easier and more maintainable than screen scraping.

Solution 2:

First, you need PhantomJS. Suggested install method on Linux:

wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
tar xvf phantomjs-2.1.1-linux-x86_64.tar.bz2
cp phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin

Second, you need the php-phantomjs package. Assuming you have installed Composer:

composer require jonnyw/php-phantomjs

Or follow installation documentation here.

Third, Load the package to your script, and instead of file_get_contents, you will load the page via PhantomJS

<?php
require ('vendor/autoload.php');

$client = Client::getInstance();
$client->getEngine()->setPath('/usr/local/bin/phantomjs');
$client = Client::getInstance();
$request  = $client->getMessageFactory()->createRequest();
$response = $client->getMessageFactory()->createResponse();

$request->setMethod('GET');
$request->setUrl('https://www.your_page_embeded_ajax_request');

$client->send($request, $response);

if($response->getStatus() === 200) {
    echo "Do something here";
}