how to dynamically generate HTML code using .NET's WebBrowser or mshtml.HTMLDocument?
Most of the answers I have read concerning this subject point to either the System.Windows.Forms.WebBrowser class or the COM interface mshtml.HTMLDocument from the Microsoft HTML Object Library assembly.
The WebBrowser class did not lead me anywhere. The following code fails to retrieve the HTML code as rendered by my web browser:
[STAThread]
public static void Main()
{
WebBrowser wb = new WebBrowser();
wb.Navigate("https://www.google.com/#q=where+am+i");
wb.DocumentCompleted += delegate(object sender, WebBrowserDocumentCompletedEventArgs e)
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)wb.Document.DomDocument;
foreach (IHTMLElement element in doc.all)
{
System.Diagnostics.Debug.WriteLine(element.outerHTML);
}
};
Form f = new Form();
f.Controls.Add(wb);
Application.Run(f);
}
The above is just an example. I'm not really interested in finding a workaround for figuring out the name of the town where I am located. I simply need to understand how to retrieve that kind of dynamically generated data programmatically.
(Call new System.Net.WebClient.DownloadString("https://www.google.com/#q=where+am+i"), save the resulting text somewhere, search for the name of the town where you are currently located, and let me know if you were able to find it.)
But yet when I access "https://www.google.com/#q=where+am+i" from my Web Browser (ie or firefox) I see the name of my town written on the web page. In Firefox, if I right click on the name of the town and select "Inspect Element (Q)" I clearly see the name of the town written in the HTML code which happens to look quite different from the raw HTML that is returned by WebClient.
After I got tired of playing System.Net.WebBrowser, I decided to give mshtml.HTMLDocument a shot, just to end up with the same useless raw HTML:
public static void Main()
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)new mshtml.HTMLDocument();
doc.write(new System.Net.WebClient().DownloadString("https://www.google.com/#q=where+am+i"));
foreach (IHTMLElement e in doc.all)
{
System.Diagnostics.Debug.WriteLine(e.outerHTML);
}
}
I suppose there must be an elegant way to obtain this kind of information. Right now all I can think of is add a WebBrowser control to a form, have it navigate to the URL in question, send the keys "CLRL, A", and copy whatever happens to be displayed on the page to the clipboard and attempt to parse it. That's horrible solution, though.
Solution 1:
I'd like to contribute some code to Alexei's answer. A few points:
Strictly speaking, it may not always be possible to determine when the page has finished rendering with 100% probability. Some pages are quite complex and use continuous AJAX updates. But we can get quite close, by polling the page's current HTML snapshot for changes and checking the
WebBrowser.IsBusy
property. That's whatLoadDynamicPage
does below.Some time-out logic has to be present on top of the above, in case the page rendering is never-ending (note
CancellationTokenSource
).Async/await
is a great tool for coding this, as it gives the linear code flow to our asynchronous polling logic, which greatly simplifies it.It's important to enable HTML5 rendering using Browser Feature Control, as
WebBrowser
runs in IE7 emulation mode by default. That's whatSetFeatureBrowserEmulation
does below.This is a WinForms app, but the concept can be easily converted into a console app.
This logic works well on the URL you've specifically mentioned: https://www.google.com/#q=where+am+i.
using Microsoft.Win32;
using System;
using System.ComponentModel;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace WbFetchPage
{
public partial class MainForm : Form
{
public MainForm()
{
SetFeatureBrowserEmulation();
InitializeComponent();
this.Load += MainForm_Load;
}
// start the task
async void MainForm_Load(object sender, EventArgs e)
{
try
{
var cts = new CancellationTokenSource(10000); // cancel in 10s
var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token);
MessageBox.Show(html.Substring(0, 1024) + "..." ); // it's too long!
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
// navigate and download
async Task<string> LoadDynamicPage(string url, CancellationToken token)
{
// navigate and await DocumentCompleted
var tcs = new TaskCompletionSource<bool>();
WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
tcs.TrySetResult(true);
using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
{
this.webBrowser.DocumentCompleted += handler;
try
{
this.webBrowser.Navigate(url);
await tcs.Task; // wait for DocumentCompleted
}
finally
{
this.webBrowser.DocumentCompleted -= handler;
}
}
// get the root element
var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];
// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
// wait asynchronously, this will throw if cancellation requested
await Task.Delay(500, token);
// continue polling if the WebBrowser is still busy
if (this.webBrowser.IsBusy)
continue;
var htmlNow = documentElement.OuterHtml;
if (html == htmlNow)
break; // no changes detected, end the poll loop
html = htmlNow;
}
// consider the page fully rendered
token.ThrowIfCancellationRequested();
return html;
}
// enable HTML5 (assuming we're running IE10+)
// more info: https://stackoverflow.com/a/18333982/1768303
static void SetFeatureBrowserEmulation()
{
if (LicenseManager.UsageMode != LicenseUsageMode.Runtime)
return;
var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
appName, 10000, RegistryValueKind.DWord);
}
}
}
Solution 2:
Your web-browser code looks reasonable - wait for something, that grab current content. Unfortunately there is no official "I'm done executing JavaScript, feel free to steal content" notification from browser nor JavaScript.
Some sort of active wait (not Sleep
but Timer
) may be necessary and page-specific. Even if you use headless browser (i.e. PhantomJS) you'll have the same issue.