Stack Overflow loading html document #499

johnkarbonhq · 2023-05-31T03:04:35Z

Here is what to include in your request to make sure we implement a solution as quickly as possible.

1. Description

Parsing html emails (which are often not well formed), I got a stack overflow. Setting HtmlDocument.MaxDepthLevel did not prevent the stack overflow.

2. Exception

If you are seeing an exception, include the full exception details (message and stack trace).

Html.SetChanged... called recursively 5,000 times

3. Fiddle or Project

Repro is here: https://dotnetfiddle.net/mKe6jk

JonathanMagnan · 2023-05-31T14:41:46Z

Hello @johnkarbonhq ,

Thank you for reporting,

My developer will look at it.

Best Regards,

Jon

JonathanMagnan · 2023-06-07T00:10:56Z

Hello @johnkarbonhq ,

Which version are you using?

When my developer tried it, he successfully got the error message raised on this line:

internal void CloseNode(HtmlNode endnode, int level = 0)
{
	if (level > HtmlDocument.MaxDepthLevel)
	{
		throw new ArgumentException(HtmlNode.DepthLevelExceptionMessage);
	}

Perharp is there something we are missing to reproduce it?

Best Regards,

Jon

elgonzo · 2023-06-07T11:43:33Z

I suspect the real code of @johnkarbonhq already being quite deep in the call stack when calling HtmlDocument.LoadHtml, thus resulting in the call stack becoming exhausted. With a simple console application that more or less directly calls HtmlDocument.LoadHtml this is probably not observable simply because the call stack is pretty much mostly empty and doesn't overflow.

However, it's still just my speculation i am unable to either prove or disprove because i can't analyse the call stack at the time of the observed Stackoverflow exception (only @johnkarbonhq can do that).

If my suspicion is correct, then reducing the value HtmlDocument.MaxDepthLevel further (perhaps like HtmlDocument.MaxDepthLevel = 50;) could avoid the observed stack overflow. And if that's the case, the suggested patch/fix isn't going to really fix the problem.

Why? Because if and at which internal HtmlAgilityPack method call a stack overflow happens is dependent on how full the call stack already is. Quite possibly, right now with the suggested patch @johnkarbonhq's application code will avoid the stack overflow using HtmlDocument.MaxDepthLevel = 100;. But it is possible that a (later) change to the application code or simply a different control flow in the application (for example when the control flow depends on external inputs/data/other factors) would lead to the call stack being filled just a little more before calling HtmlDocument.LoadHtml and where setting HtmlDocument.MaxDepthLevel = 100; would then not prevent a stackoverflow exception anymore. Due to its reliance on a very specific and particular assumption about the call stack usage of the application that's calling HtmlDocument.LoadHtml, the suggested fix/patch appears to me as a very brittle band-aid for the observed problem

Ideal solution -- and also most expensive, sadly -- would be for the library to avoid deep recursion and converting potentially deep recursions into iterations or loops wherever practically feasible. For the SetChanged() method, this could for example look like this without losing much readability and maintainability (Warning! Untested code!):

internal void SetChanged()
{
	var currentNode = this;
	do
	{
		currentNode._changed = true;
		currentNode = currentNode.ParentNode;
	}
	while (currentNode is not null);
}

And a HtmlDocument.MaxDepthLevel check should then apply only to the remaining potentially deep recursions related to document traversal and to other functions which have to enforce HtmlDocument.MaxDepthLevel.

And in case the application code itself keeps eating up the call stack like there is no tomorrow where even the best and most thorough recursion optimization in HtmlAgilityPack wouldn't completely prevent call stack exhaustion, then the solution lies with the application: Either reduce its call stack usage or (as a workaround) execute call stack-intensive code in individual background thread (as each thread has its own dedicated call stack).

johnkarbonhq mentioned this issue May 31, 2023

When traversing the DOM during SetChanged track the depth to avoid StackOverflowException #500

Open

JonathanMagnan self-assigned this May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack Overflow loading html document #499

Stack Overflow loading html document #499

johnkarbonhq commented May 31, 2023

JonathanMagnan commented May 31, 2023

JonathanMagnan commented Jun 7, 2023

elgonzo commented Jun 7, 2023 •

edited

Stack Overflow loading html document #499

Stack Overflow loading html document #499

Comments

johnkarbonhq commented May 31, 2023

1. Description

2. Exception

3. Fiddle or Project

JonathanMagnan commented May 31, 2023

JonathanMagnan commented Jun 7, 2023

elgonzo commented Jun 7, 2023 • edited

elgonzo commented Jun 7, 2023 •

edited