Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack Overflow loading html document #499

Open
johnkarbonhq opened this issue May 31, 2023 · 3 comments
Open

Stack Overflow loading html document #499

johnkarbonhq opened this issue May 31, 2023 · 3 comments
Assignees

Comments

@johnkarbonhq
Copy link

Here is what to include in your request to make sure we implement a solution as quickly as possible.

1. Description

Parsing html emails (which are often not well formed), I got a stack overflow. Setting HtmlDocument.MaxDepthLevel did not prevent the stack overflow.

2. Exception

If you are seeing an exception, include the full exception details (message and stack trace).

Html.SetChanged... called recursively 5,000 times

3. Fiddle or Project

Repro is here: https://dotnetfiddle.net/mKe6jk

@JonathanMagnan
Copy link
Member

Hello @johnkarbonhq ,

Thank you for reporting,

My developer will look at it.

Best Regards,

Jon

@JonathanMagnan
Copy link
Member

Hello @johnkarbonhq ,

Which version are you using?

When my developer tried it, he successfully got the error message raised on this line:

internal void CloseNode(HtmlNode endnode, int level = 0)
{
	if (level > HtmlDocument.MaxDepthLevel)
	{
		throw new ArgumentException(HtmlNode.DepthLevelExceptionMessage);
	}

Perharp is there something we are missing to reproduce it?

Best Regards,

Jon

@elgonzo
Copy link
Contributor

elgonzo commented Jun 7, 2023

I suspect the real code of @johnkarbonhq already being quite deep in the call stack when calling HtmlDocument.LoadHtml, thus resulting in the call stack becoming exhausted. With a simple console application that more or less directly calls HtmlDocument.LoadHtml this is probably not observable simply because the call stack is pretty much mostly empty and doesn't overflow.

However, it's still just my speculation i am unable to either prove or disprove because i can't analyse the call stack at the time of the observed Stackoverflow exception (only @johnkarbonhq can do that).

If my suspicion is correct, then reducing the value HtmlDocument.MaxDepthLevel further (perhaps like HtmlDocument.MaxDepthLevel = 50;) could avoid the observed stack overflow. And if that's the case, the suggested patch/fix isn't going to really fix the problem.

Why? Because if and at which internal HtmlAgilityPack method call a stack overflow happens is dependent on how full the call stack already is. Quite possibly, right now with the suggested patch @johnkarbonhq's application code will avoid the stack overflow using HtmlDocument.MaxDepthLevel = 100;. But it is possible that a (later) change to the application code or simply a different control flow in the application (for example when the control flow depends on external inputs/data/other factors) would lead to the call stack being filled just a little more before calling HtmlDocument.LoadHtml and where setting HtmlDocument.MaxDepthLevel = 100; would then not prevent a stackoverflow exception anymore. Due to its reliance on a very specific and particular assumption about the call stack usage of the application that's calling HtmlDocument.LoadHtml, the suggested fix/patch appears to me as a very brittle band-aid for the observed problem

Ideal solution -- and also most expensive, sadly -- would be for the library to avoid deep recursion and converting potentially deep recursions into iterations or loops wherever practically feasible. For the SetChanged() method, this could for example look like this without losing much readability and maintainability (Warning! Untested code!):

internal void SetChanged()
{
	var currentNode = this;
	do
	{
		currentNode._changed = true;
		currentNode = currentNode.ParentNode;
	}
	while (currentNode is not null);
}

And a HtmlDocument.MaxDepthLevel check should then apply only to the remaining potentially deep recursions related to document traversal and to other functions which have to enforce HtmlDocument.MaxDepthLevel.

And in case the application code itself keeps eating up the call stack like there is no tomorrow where even the best and most thorough recursion optimization in HtmlAgilityPack wouldn't completely prevent call stack exhaustion, then the solution lies with the application: Either reduce its call stack usage or (as a workaround) execute call stack-intensive code in individual background thread (as each thread has its own dedicated call stack).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants