How to get WGET to download exact same web page html as browser

Using a web browser (IE or Chrome) I can save a web page (.html) with Ctl-S, inspect it with any text editor, and see data in a table format. One of those numbers I want to extract, but for many, many web pages, too many to do manually. So I'd like to use WGET to get those web pages one after another, and write another program to parse the .html and retrieve the number I want. But the .html file saved by WGET when using the same URL as the browser does not contain the data table. Why not? It is as if the server detects the request is coming from WGET and not from a web browser, and supplies a skeleton web page, lacking the data table. How can I get the exact same web page with WGET? - Thx!

MORE INFO:

An example of the URL I'm trying to fetch is: http://performance.morningstar.com/fund/performance-return.action?t=ICENX&region=usa&culture=en-US where the string ICENX is a mutual fund ticker symbol, which I will be changing to any of a number of different ticker symbols. This downloads a table of data when viewed in a browser, but the data table is missing if fetched with WGET.


As roadmr noted, the table on this page is generated by javascript. wget doesn't support javascript, it just dumps the page as received from the server (ie before any javascript code runs) and so the table is missing.

You need a headless browser that supports javascript like phantomjs:

$ phantomjs save_page.js http://example.com > page.html

with save_page.js:

var system = require('system');
var page = require('webpage').create();

page.open(system.args[1], function()
{
    console.log(page.content);
    phantom.exit();
});

Then if you just want to extract some text, easiest might be to render the page with w3m:

$ w3m -dump page.html

and/or modify the phantomjs script to just dump what you're interested in.


You can download a Full Website Using wget --mirror

Example:

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

The above command line which you want to execute when you want to download a full website and made available for local viewing.

Options:

  • --mirror turns on options suitable for mirroring.

  • -p downloads all files that are necessary to properly display a given HTML page.

  • --convert-links after the download, convert the links in document for local viewing.

  • -P ./LOCAL-DIR saves all the files and directories to the specified directory.

For more Info about Wget Options Read More this article: Overview About all wget Commands with Examples, or check Wget's man page.


Instead of --recursive, which will just go ahead and "spider" every single link in your URL, use --page-requisites. Should behave exactly as the options you describe in graphical browsers.

       This option causes Wget to download all the files that are
       necessary to properly display a given HTML page.  This includes
       such things as inlined images, sounds, and referenced stylesheets.

       Ordinarily, when downloading a single HTML page, any requisite
       documents that may be needed to display it properly are not
       downloaded.  Using -r together with -l can help, but since Wget
       does not ordinarily distinguish between external and inlined
       documents, one is generally left with "leaf documents" that are
       missing their requisites.

For more information, do man wget and look for the --page-requisites option (use "/" to search while reading a man page).


If the server's answer differs depending on an asking source, it is mostly because of HTTP_USER_AGENT variable (just a text string) that is provided with a request from the asking source, informing the server about technology.


  1. You can check Your browser agent here -> http://whatsmyuseragent.com

  2. According to the WGET manual this parameter should do the job --user-agent=AGENT.


If this does not help, i.e. JavaScript processing may be needed to get the same page as a browser, or maybe appropriate request with GET parameters so the server will prepare answer that doesn't require JavaScript to fill the page.