How to get WGET to download exact same web page html as browser
Using a web browser (IE or Chrome) I can save a web page (.html) with Ctl-S, inspect it with any text editor, and see data in a table format. One of those numbers I want to extract, but for many, many web pages, too many to do manually. So I'd like to use WGET to get those web pages one after another, and write another program to parse the .html and retrieve the number I want. But the .html file saved by WGET when using the same URL as the browser does not contain the data table. Why not? It is as if the server detects the request is coming from WGET and not from a web browser, and supplies a skeleton web page, lacking the data table. How can I get the exact same web page with WGET? - Thx!
MORE INFO:
An example of the URL I'm trying to fetch is: http://performance.morningstar.com/fund/performance-return.action?t=ICENX®ion=usa&culture=en-US where the string ICENX is a mutual fund ticker symbol, which I will be changing to any of a number of different ticker symbols. This downloads a table of data when viewed in a browser, but the data table is missing if fetched with WGET.
As roadmr noted, the table on this page is generated by javascript. wget doesn't support javascript, it just dumps the page as received from the server (ie before any javascript code runs) and so the table is missing.
You need a headless browser that supports javascript like phantomjs:
$ phantomjs save_page.js http://example.com > page.html
with save_page.js:
var system = require('system');
var page = require('webpage').create();
page.open(system.args[1], function()
{
console.log(page.content);
phantom.exit();
});
Then if you just want to extract some text, easiest might be to render the page with w3m:
$ w3m -dump page.html
and/or modify the phantomjs script to just dump what you're interested in.
You can download a Full Website Using wget --mirror
Example:
wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL
The above command line which you want to execute when you want to download a full website and made available for local viewing.
Options:
--mirror
turns on options suitable for mirroring.-p
downloads all files that are necessary to properly display a given HTML page.--convert-links
after the download, convert the links in document for local viewing.-P ./LOCAL-DIR
saves all the files and directories to the specified directory.
For more Info about Wget Options Read More this article: Overview About all wget Commands with Examples, or check Wget's man page.
Instead of --recursive
, which will just go ahead and "spider" every single link in your URL, use --page-requisites
. Should behave exactly as the options you describe in graphical browsers.
This option causes Wget to download all the files that are
necessary to properly display a given HTML page. This includes
such things as inlined images, sounds, and referenced stylesheets.
Ordinarily, when downloading a single HTML page, any requisite
documents that may be needed to display it properly are not
downloaded. Using -r together with -l can help, but since Wget
does not ordinarily distinguish between external and inlined
documents, one is generally left with "leaf documents" that are
missing their requisites.
For more information, do man wget
and look for the --page-requisites
option (use "/" to search while reading a man page).
If the server's answer differs depending on an asking source, it is mostly because of HTTP_USER_AGENT variable (just a text string) that is provided with a request from the asking source, informing the server about technology.
You can check Your browser agent here -> http://whatsmyuseragent.com
According to the WGET manual this parameter should do the job
--user-agent=AGENT
.
If this does not help, i.e. JavaScript processing may be needed to get the same page as a browser, or maybe appropriate request with GET parameters so the server will prepare answer that doesn't require JavaScript to fill the page.