How can I scrape website content in PHP from a website that requires a cookie login?

My problem is that it doesn't just require a basic cookie, but rather asks for a session cookie, and for randomly generated IDs. I think this means I need to use a web browser emulator with a cookie jar?

I have tried to use Snoopy, Goutte and a couple of other web browser emulators, but as of yet I have not been able to find tutorials on how to receive cookies. I am getting a little desperate!

Can anyone give me an example of how to accept cookies in Snoopy or Goutte?

Thanks in advance!


Solution 1:

You can do that in cURL without needing external 'emulators'.

The code below retrieves a page into a PHP variable to be parsed.

Scenario

There is a page (let's call it HOME) that opens the session. Server side, if it is in PHP, is the one (any one actually) calling session_start() for the first time. In other languages you need a specific page that will do all the session setup. From the client side it's the page supplying the session ID cookie. In PHP, all sessioned pages do; in other languages the landing page will do it, all the others will check if the cookie is there, and if there isn't, instead of creating the session, will drop you to HOME.

There is a page (LOGIN) that generates the login form and adds a critical information to the session - "This user is logged in". In the code below, this is the page asking for the session ID.

And finally there are N pages where the goodies to be scrapes reside.

So we want to hit HOME, then LOGIN, then GOODIES one after another. In PHP (and other languages actually), again, HOME and LOGIN might well be the same page. Or all pages might share the same address, for example in Single Page Applications.

The Code

    $url            = "the url generating the session ID";
    $next_url       = "the url asking for session";

    $ch             = curl_init();
    curl_setopt($ch, CURLOPT_URL,    $url);
    // We do not authenticate, only access page to get a session going.
    // Change to False if it is not enough (you'll see that cookiefile
    // remains empty).
    curl_setopt($ch, CURLOPT_NOBODY, True);

    // You may want to change User-Agent here, too
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookiefile");
    curl_setopt($ch, CURLOPT_COOKIEJAR,  "cookiefile");

    // Just in case
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    $ret    = curl_exec($ch);

    // This page we retrieve, and scrape, with GET method
    foreach(array(
            CURLOPT_POST            => False,       // We GET...
            CURLOPT_NOBODY          => False,       // ...the body...
            CURLOPT_URL             => $next_url,   // ...of $next_url...
            CURLOPT_BINARYTRANSFER  => True,        // ...as binary...
            CURLOPT_RETURNTRANSFER  => True,        // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => True,        // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,           // ...reasonably...
            CURLOPT_REFERER         => $url,        // ...as if we came from $url...
            //CURLOPT_COOKIEFILE      => 'cookiefile', // Save these cookies
            //CURLOPT_COOKIEJAR       => 'cookiefile', // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,          // Seconds
            CURLOPT_TIMEOUT         => 300,         // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,       // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,          // 
            ) as $option => $value)
            if (!curl_setopt($ch, $option, $value))
                    die("could not set $option to " . serialize($value));

    $ret = curl_exec($ch);
    // Done; cleanup.
    curl_close($ch);

Implementation

First of all we have to get the login page.

We use a special User-Agent to introduce ourselves, in order both to be recognizable (we don't want to antagonize the webmaster) but also to fool the server into sending us a specific version of the site that is browser tailored. Ideally, we use the same User-Agent as any browser we're going to use to debug the page, plus a suffix to make it clear to whoever checks that it is an automated tool they're looking at (see comment by Halfer).

    $ua = 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 (ROBOT)';
    $cookiefile = "cookiefile";
    $url1 = "the login url generating the session ID";

    $ch             = curl_init();

    curl_setopt($ch, CURLOPT_URL,            $url1);
    curl_setopt($ch, CURLOPT_USERAGENT,      $ua);
    curl_setopt($ch, CURLOPT_COOKIEFILE,     $cookiefile);
    curl_setopt($ch, CURLOPT_COOKIEJAR,      $cookiefile);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, True);
    curl_setopt($ch, CURLOPT_NOBODY,         False);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, True);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, True);
    $ret    = curl_exec($ch);

This will retrieve the page asking for user/password. By inspecting the page, we find the needed fields (including hidden ones) and can populate them. The FORM tag tells us whether we need to go on with POST or GET.

We might want to inspect the form code to adjust the following operations, so we ask cURL to return the page content as-is into $ret, and to do return the page body. Sometimes, CURLOPT_NOBODY set to True is still enough to trigger session creation and cookie submission, and if so, it's faster. But CURLOPT_NOBODY ("no body") works by issuing a HEAD request, instead of a GET; and sometimes the HEAD request doesn't work because the server will only react to a full GET.

Instead of retrieving the body this way, it is also possible to login using a real Firefox and sniff the form content being posted with Firebug (or Chrome with Chrome Tools); some sites will try and populate/modify hidden fields with Javascript, so that the form being submitted will not be the one you see in the HTML code.

A webmaster who wanted his site not scraped might send a hidden field with the timestamp. A human being (not aided by a too-clever browser - there are ways to tell browsers not to be clever; at worst, every time you change the name of user and pass fields) takes at least three seconds to fill a form. A cURL script takes zero. Of course, a delay can be simulated. It's all shadowboxing...

We may also want to be on the lookout for form appearance. A webmaster could for example build a form asking name, email, and password; and then, through use of CSS, move the "email" field where you would expect to find the name, and vice versa. So the real form being submitted will have a "@" in a field called username, none in the field called email. The server, that expects this, merely inverts again the two fields. A "scraper" built by hand (or a spambot) would do what seems natural, and send an email in the email field. And by so doing, it betrays itself. By working through the form once with a real CSS and JS aware browser, sending meaningful data, and sniffing what actually gets sent, we might be able to overcome this particular obstacle. Might, because there are ways of making life difficult. As I said, shadowboxing.

Back to the case at hand, in this case the form contains three fields and has no Javascript overlay. We have cPASS, cUSR, and checkLOGIN with a value of 'Check login'.

So we prepare the form with the proper fields. Note that the form is to be sent as application/x-www-form-urlencoded, which in PHP cURL means two things:

  • we are to use CURLOPT_POST
  • the option CURLOPT_POSTFIELDS must be a string (an array would signal cURL to submit as multipart/form-data, which might work... or might not).

The form fields are, as it says, urlencoded; there's a function for that.

We read the action field of the form; that's the URL we are to use to submit our authentication (which we must have).

So everything being ready...

    $fields = array(
        'checkLOGIN' => 'Check Login',
        'cUSR'       => 'jb007',
        'cPASS'      => 'astonmartin',
    );
    $coded = array();
    foreach($fields as $field => $value)
        $coded[] = $field . '=' . urlencode($value);
    $string = implode('&', $coded);

    curl_setopt($ch, CURLOPT_URL,         $url1); //same URL as before, the login url generating the session ID
    curl_setopt($ch, CURLOPT_POST,        True);
    curl_setopt($ch, CURLOPT_POSTFIELDS,  $string);
    $ret    = curl_exec($ch);

We expect now a "Hello, James - how about a nice game of chess?" page. But more than that, we expect that the session associated with the cookie saved in the $cookiefile has been supplied with the critical information -- "user is authenticated".

So all following page requests made using $ch and the same cookie jar will be granted access, allowing us to 'scrape' pages quite easily - just remember to set request mode back to GET:

    curl_setopt($ch, CURLOPT_POST,        False);

    // Start spidering
    foreach($urls as $url)
    {
        curl_setopt($ch, CURLOPT_URL, $url);
        $HTML = curl_exec($ch);
        if (False === $HTML)
        {
            // Something went wrong, check curl_error() and curl_errno().
        }
    }
    curl_close($ch);

In the loop, you have access to $HTML -- the HTML code of every single page.

Great the temptation of using regexps is. Resist it you must. To better cope with ever-changing HTML, as well as being sure not to turn up false positives or false negatives when the layout stays the same but the content changes (e.g. you discover that you have the weather forecasts of Nice, Tourrette-Levens, Castagniers, but never Asprémont or Gattières, and isn't that cürious?), the best option is to use DOM:

Grabbing the href attribute of an A element

Solution 2:

Object-Oriented answer

We implement as much as possible of the previous answer in one class called Browser that should supply the normal navigation features.

Then we should be able to put the site-specific code, in very simple form, in a new derived class that we call, say, FooBrowser, that performs scraping of the site Foo.

The class deriving Browser must supply some site-specific function such as a path() function allowing to store site-specific information, for example

function path($basename) {
    return '/var/tmp/www.foo.bar/' . $basename;
}

abstract class Browser
{
    private $options = [];
    private $state   = [];
    protected $cookies;

    abstract protected function path($basename);

    public function __construct($site, $options = []) {
        $this->cookies   = $this->path('cookies');
        $this->options  = array_merge(
            [
                'site'      => $site,
                'userAgent' => 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 - LeoScraper',
                'waitTime'  => 250000,
            ],
            $options
        );
        $this->state = [
            'referer' => '/',
            'url'     => '',
            'curl'    => '',
        ];
        $this->__wakeup();
    }

    /**
     * Reactivates after sleep (e.g. in session) or creation
     */
    public function __wakeup() {
        $this->state['curl'] = curl_init();
        $this->config([
            CURLOPT_USERAGENT       => $this->options['userAgent'],
            CURLOPT_ENCODING        => '',
            CURLOPT_NOBODY          => false,
            // ...retrieving the body...
            CURLOPT_BINARYTRANSFER  => true,
            // ...as binary...
            CURLOPT_RETURNTRANSFER  => true,
            // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => true,
            // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,
            // ...reasonably...
            CURLOPT_COOKIEFILE      => $this->cookies,
            // Save these cookies
            CURLOPT_COOKIEJAR       => $this->cookies,
            // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,
            // Seconds
            CURLOPT_TIMEOUT         => 300,
            // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,
            // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,
        ]);
    }

    /**
     * Imports an options array.
     *
     * @param array $opts
     * @throws DetailedError
     */
    private function config(array $opts = []) {
        foreach ($opts as $key => $value) {
            if (true !== curl_setopt($this->state['curl'], $key, $value)) {
                throw new \Exception('Could not set cURL option');
            }
        }
    }

    private function perform($url) {
        $this->state['referer'] = $this->state['url'];
        $this->state['url'] = $url;
        $this->config([
            CURLOPT_URL     => $this->options['site'] . $this->state['url'],
            CURLOPT_REFERER => $this->options['site'] . $this->state['referer'],
        ]);
        $response = curl_exec($this->state['curl']);
        // Should we ever want to randomize waitTime, do so here.
        usleep($this->options['waitTime']);

        return $response;
    }

    /**
     * Returns a configuration option.
     * @param string $key       configuration key name
     * @param string $value     value to set
     * @return mixed
     */
    protected function option($key, $value = '__DEFAULT__') {
        $curr   = $this->options[$key];
        if ('__DEFAULT__' !== $value) {
            $this->options[$key]    = $value;
        }
        return $curr;
    }

    /**
     * Performs a POST.
     *
     * @param $url
     * @param $fields
     * @return mixed
     */
    public function post($url, array $fields) {
        $this->config([
            CURLOPT_POST       => true,
            CURLOPT_POSTFIELDS => http_build_query($fields),
        ]);
        return $this->perform($url);
    }

    /**
     * Performs a GET.
     *
     * @param       $url
     * @param array $fields
     * @return mixed
     */
    public function get($url, array $fields = []) {
        $this->config([ CURLOPT_POST => false ]);
        if (empty($fields)) {
            $query = '';
        } else {
            $query = '?' . http_build_query($fields);
        }
        return $this->perform($url . $query);
    }
}

Now to scrape FooSite:

/* WWW_FOO_COM requires username and password to construct */

class WWW_FOO_COM_Browser extends Browser
{
    private $loggedIn   = false;

    public function __construct($username, $password) {
        parent::__construct('http://www.foo.bar.baz', [
            'username'  => $username,
            'password'  => $password,
            'waitTime'  => 250000,
            'userAgent' => 'FooScraper',
            'cache'     => true
        ]);
        // Open the session
        $this->get('/');
        // Navigate to the login page
        $this->get('/login.do');
    }

    /**
     * Perform login.
     */
    public function login() {
        $response = $this->post(
            '/ajax/loginPerform',
            [
                'j_un'    => $this->option('username'),
                'j_pw'    => $this->option('password'),
            ]
        );
        // TODO: verify that response is OK.
        // if (!strstr($response, "Welcome " . $this->option('username'))
        //     throw new \Exception("Bad username or password")
        $this->loggedIn = true;
        return true;
    }

    public function scrape($entry) {
        // We could implement caching to avoid scraping the same entry
        // too often. Save $data into path("entry-" . md5($entry))
        // and verify the filemtime of said file, is it newer than time()
        // minus, say, 86400 seconds? If yes, return file_get_content and
        // leave remote site alone.
        $data = $this->get(
            '/foobars/baz.do',
            [
                'ticker' => $entry
            ]
        );
        return $data;
    }

Now the actual scraping code would be:

    $scraper = new WWW_FOO_COM_Browser('lserni', 'mypassword');
    if (!$scraper->login()) {
        throw new \Exception("bad user or pass");
    }
    // www.foo.com is a ticker site, we need little info for each
    // Other examples might be much more complex.
    $entries = [
        'APPL', 'MSFT', 'XKCD'
    ];
    foreach ($entries as $entry) {
        $html = $scraper->scrape($entry);
        // Parse HTML
    }

Mandatory notice: use a suitable parser to get data from raw HTML.