Screen Scraping from a web page with a lot of Javascript [closed]

Solution 1:

You may consider using HTMLunit It's a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There's also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but if your pages don't use many hacks things should work fine the performance should be way better than controlling a browser. Furthermore, you don't have to worry about cookies being persisted after your scraping is over and all the other nasty things connected to controlling a browser (history, autocomplete, temp files etc).

Solution 2:

Since you say that no AJAX is used, then all the info is present at the HTML source. The javascript just renders it based on user clicks. So you need to reverse engineer the way the application works, parse the html and the javascript code and extract the useful information. It is strictly business of text parsing - you shouldn't deal with running javascript and producing a new DOM. This would be much more difficult to do.

If AJAX was used, your job would be easier. You could easily find out how the AJAX services work (probably by receiving JSON and XML) and extract the information.

Solution 3:

You could consider using a greasemonkey JS. greasemonkey is a very powerful Firefox add on that allows you to run your own script alongside that of specific web sites. This allows you to modify how the web site is displayed, add or remove content. You can even use it to do AJAX style lookups and add dynamic content.

If your tool is for in house use, and users are all happy to use Firefox then this could be a winner.

Regards

Solution 4:

I suggest IRobotSoft web scraper. It is a dedicated free software for screen scraping with the best javascript support. You can create and test a robot with its visual interface. You can also embed it into your own application using its ActiveX control and hide the browser window.