Simple Screen Scraping using jQuery

Solution 1:

Use $.ajax to load the other page into a variable, then create a temporary element and use .html() to set the contents to the value returned. Loop through the element's children of nodeType 1 and keep their first children's nodeValues. If the external page is not on your web server you will need to proxy the file with your own web server.

Something like this:

$.ajax({
     url: "/thePageToScrape.html",
     dataType: 'text',
     success: function(data) {
          var elements = $("<div>").html(data)[0].getElementsByTagName("ul")[0].getElementsByTagName("li");
          for(var i = 0; i < elements.length; i++) {
               var theText = elements[i].firstChild.nodeValue;
               // Do something here
          }
     }
});

Solution 2:

Simple scraping with jQuery...

// Get HTML from page
$.get( 'http://example.com/', function( html ) {

    // Loop through elements you want to scrape content from
    $(html).find("ul").find("li").each( function(){

        var text = $(this).text();
        // Do something with content

    } )

} );

Solution 3:

$.get("/path/to/other/page",function(data){
  $('#data').append($('li',data));
}

Solution 4:

If this is for the same domain then no problem - the jQuery solution is good.

But otherwise you can't access content from an arbitrary website because this is considered a security risk. See same origin policy.

There are of course server side workarounds such as a web proxy or CORS headers. Of if you're lucky they will support jsonp.

But if you want a client side solution to work with an arbitrary website and web browser then you are out of luck. There is a proposal to relax this policy, but this won't effect current web browsers.

Solution 5:

You may want to consider pjscrape:

http://nrabinowitz.github.io/pjscrape/

It allows you to do this from the command-line, using javascript and jQuery. It does this by using PhantomJS, which is a headless webkit browser (it has no window, and it exists only for your script's usage, so you can load complex websites that use AJAX and it will work just as if it were a real browser).

The examples are self-explanatory and I believe this works on all platforms (including Windows).