Work-around a StackOverflowException

Solution 1:

I just patched an error that I believe is the same as your describing. Uploaded the patch to the hap project site...

http://www.codeplex.com/site/users/view/sjdirect (see the patch on 3/8/2012)

Or see more documentation of the issue and result here....

https://code.google.com/p/abot/issues/detail?id=77

The actual fix was... Added HtmlDocument.OptionMaxNestedChildNodes that can be set to prevent StackOverflowExceptions that are caused by tons of nested tags. It will throw an ApplicationException with message "Document has more than X nested tags. This is likely due to the page not closing tags properly."

How I'm Using Hap After Patch...

HtmlDocument hapDoc = new HtmlDocument();
hapDoc.OptionMaxNestedChildNodes = 5000;//This is what was added
string rawContent = GETTHECONTENTHERE
try
{
    hapDoc.LoadHtml(RawContent);    
}
catch (Exception e)
{
    //Instead of a stackoverflow exception you should end up here now
    hapDoc.LoadHtml("");
    _logger.Error(e);
}

Solution 2:

Ideally, the long-term solution is to patch HtmlAgilityPack to use a heap-stack instead of the call-stack, but that would be an undertaking too big for me. I've temporarily lost my CodePlex account details, but when I get them back I'll submit an Issue report on the problem. I also note that this issue could present a Denial-of-Service attack vulnerability to any site that uses HtmlAgilityPack to sanitize user-submitted HTML - a crafted overly-nested HTML document would cause the w3wp.exe process to die.

In the meantime, I figured the best way forward is to manually override the maximum thread stack size. I was wrong in my earlier statement that a bigger stack-size means that all threads automatically consume that memory (it seems memory pages are allocated for a thread stack as it grows, not all-at-once).

I made a copy of the <ol><li> page and ran some experiments. I found that my program failed when the stack size was less than 2^21 bytes (2MB) in size, but a maximum size of 2^22 bytes (4MB) succeeded - and 4MB in my book passes as an "acceptable" hack... for now.