Scrape web pages in real time with Node.js
Node.io seems to take the cake :-)
All aforementioned solutions presume running the scraper locally. This means you will be severely limited in performance (due to running them in sequence or in a limited set of threads). A better approach, imho, is to rely on an existing, albeit commercial, scraping grid.
Here is an example:
var bobik = new Bobik("YOUR_AUTH_TOKEN");
bobik.scrape({
urls: ['amazon.com', 'zynga.com', 'http://finance.google.com/', 'http://shopping.yahoo.com'],
queries: ["//th", "//img/@src", "return document.title", "return $('script').length", "#logo", ".logo"]
}, function (scraped_data) {
if (!scraped_data) {
console.log("Data is unavailable");
return;
}
var scraped_urls = Object.keys(scraped_data);
for (var url in scraped_urls)
console.log("Results from " + url + ": " + scraped_data[scraped_urls[url]]);
});
Here, scraping is performed remotely and a callback is issued to your code only when results are ready (there is also an option to collect results as they become available).
You can download Bobik client proxy SDK at https://github.com/emirkin/bobik_javascript_sdk
I've been doing research myself, and https://npmjs.org/package/wscraper boasts itself as a
a web scraper agent based on cheerio.js a fast, flexible, and lean implementation of core jQuery; built on top of request.js; inspired by http-agent.js
Very low usage (according to npmjs.org) but worth a look for any interested parties.
You don't always need to jQuery. If you play with the DOM returned from jsdom for example you can easily take what you need yourself (also considering you dont have to worry about xbrowser issues.) See: https://gist.github.com/1335009 that's not taking away from node.io at all, just saying you might be able to do it yourself depending...
The new way using ES7/promises
Usually when you're scraping you want to use some method to
- Get the resource on the webserver (html document usually)
- Read that resource and work with it as
- A DOM/tree structure and make it navigable
- parse it as token-document with something like SAS.
Both tree, and token-parsing have advantages, but tree is usually substantially simpler. We'll do that. Check out request-promise, here is how it works:
const rp = require('request-promise');
const cheerio = require('cheerio'); // Basically jQuery for node.js
const options = {
uri: 'http://www.google.com',
transform: function (body) {
return cheerio.load(body);
}
};
rp(options)
.then(function ($) {
// Process html like you would with jQuery...
})
.catch(function (err) {
// Crawling failed or Cheerio
This is using cheerio which is essentially a lightweight server-side jQuery-esque library (that doesn't need a window object, or jsdom).
Because you're using promises, you can also write this in an asychronous function. It'll look synchronous, but it'll be asynchronous with ES7:
async function parseDocument() {
let $;
try {
$ = await rp(options);
} catch (err) { console.error(err); }
console.log( $('title').text() ); // prints just the text in the <title>
}