Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonlinear performance with just a few bad tags, among a sea of valid HTML. #161

Closed
GilesDMiddleton opened this issue Mar 27, 2019 · 5 comments

Comments

@GilesDMiddleton
Copy link

Great component, loving it, but...
while using HtmlSanitizer, I noticed some poor, non-linear performance figures. I understand my data is not representative of HTML pages, but obviously when data has come from external systems, which may have not been guarded or sanitized, it might not be small. My fear is that the poor performance could expose this component to a denial of service attack. I understand this could all be AngleSharp's problem, but wondered if this component could mitigate/prevent these issues...

To test: I created HTML simply containing <br/> tags, and injected one or two bad <br> tags at position 196 and 1736. And in the extreme case 100 bad tags scattered randomly.

Here are the figures for performance.

HTML Size Number of bad BRs Time taken Added Memory
800k 1 6s 20mb
800k 2 7.3s 20mb
8mb 1 612s 145mb
8mb 2 650s 145mb
8mb 100 633s 210mb
80mb 1 ?s 1.6Gb
100mb 1 ?s 2Gb

?=Didn't wait to find out, but longer than I can be bothered to wait.

Is this what you'd expect? All we do that's special is whitelist 60 tags, allow "face" attributes and disallow "src" attributes.

Breaking the debugger usually shows that all the work is being done in AngleSharp.Dom.Node.RemoveChild, but I haven't run a perfview to find out more.

If there are size, speed, memory limits to this module, can they be published?

At the moment, my own plan for mitigation is that once any embeded img data tags are stripped, if the size of the HTML is over 1mb, I won't bother sanitizing it, and may reject it.

@mganss
Copy link
Owner

mganss commented Mar 28, 2019

Can you post a zip with the test HTML files or a code snippet to generate the test files?

@GilesDMiddleton
Copy link
Author

lotsofBRs8mb.zip

@mganss mganss closed this as completed in 0a9c2df Mar 31, 2019
@mganss
Copy link
Owner

mganss commented Apr 1, 2019

This was due to inefficient parsing. Improved in 4.0.210 and I'm seeing linear performance now.

@GilesDMiddleton
Copy link
Author

This was due to inefficient parsing. Improved in 4.0.210 and I'm seeing linear performance now.

Thanks for the quick turnaround! We'll be testing that shortly.

@GilesDMiddleton
Copy link
Author

Fantastic - 100mb in 52s, 8mb in~4s. Brilliant stuff. Thank you, bravo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants