Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new SanitizeDom method taking an IHtmlDocument as a param #235

Merged
merged 7 commits into from
Aug 3, 2020
Merged

Add a new SanitizeDom method taking an IHtmlDocument as a param #235

merged 7 commits into from
Aug 3, 2020

Conversation

The-Nutty
Copy link
Contributor

Why

In my use case I have already used anglesharp to parse the document before calling .SanitizeDocument and i need to do more work after calling .SanitizeDocument. This means that in this section anglesharp is parsing the document 3 times, on a large document with CSS support that can be quite slow.

What

I have added an extra external API with the same SanitizeDom name as it seemed to fit (not sure ifs its technically the correct place however).

Potential issues

  • If the document is parsed without css support that will cause issues, i have added this in the summary method docs, but im not sure if its possible to tell at runtime if it has been pared with css support and throw a reasonable exception.

@codecov
Copy link

codecov bot commented Jul 31, 2020

Codecov Report

Merging #235 into master will increase coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #235      +/-   ##
==========================================
+ Coverage   91.61%   91.68%   +0.06%     
==========================================
  Files           4        4              
  Lines         620      625       +5     
  Branches       80       81       +1     
==========================================
+ Hits          568      573       +5     
  Misses         42       42              
  Partials       10       10              
Impacted Files Coverage Δ
src/HtmlSanitizer/HtmlSanitizer.cs 94.80% <100.00%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8302a0a...2ad943b. Read the comment docs.

@mganss
Copy link
Owner

mganss commented Jul 31, 2020

Thanks. I think it would cover more use cases if we also surfaced the context parameter, perhaps null by default meaning the whole document will be sanitized. Also please add a few basic unit tests.

/// <summary>
/// Sanitizes the specified parsed HTML body fragment.
/// The Document Must have been pared with CSS support and the following options enabled
/// "IsIncludingUnknownDeclarations", "IsIncludingUnknownRules" and "IsToleratingInvalidSelectors"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you pass a document that was not parsed using these options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are not using WithCss then IElement.GetStyle() will always return null meaning all style attributes are stripped. I think it will likewise effect parsing style sheets although i have not tested that.

Looks like document.Context.GetCssStyling(); well be null if css has not been configured so that might be worth throwing if its null?

As for the individual options i believe angle sharp will strip the unrecognised part when parsing them, in which case this could be removed? As it should not effect the Sanitizer's operation.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the caller's problem if all style will be stripped unless specific parsing options were used. Does something throw in HtmlSanitizer if these options are left out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the style attribute it will fail silently, im not sure about style sheets.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests run fine w/o these options. It seems they are only there for legacy reasons. If you leave out CSS support a number of tests fail of course but none throw (except the threads test but that's expected as it's just a meta test). So I think we can remove the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok i have removed the last line of the comment

/// The Document Must have been pared with CSS support and the following options enabled
/// "IsIncludingUnknownDeclarations", "IsIncludingUnknownRules" and "IsToleratingInvalidSelectors"
/// </summary>
/// <param name="document">The pared HTML Document.</param>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> parsed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still showing "pared".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it in one place not the other 4...

@The-Nutty
Copy link
Contributor Author

I have made what i believe is the context change as you discussed, can you confirm thats what you where after?

On the unit tests, im happy to add some, what did you have in mind as all the existing ones seem to be testing that it santizes things correctly, should i just copy one/a few existing tests and pre parse the html? If we are throwing if Css has not been configured then thats an easy one. And some that based on the context passed in verifies that things are/are not santized depending on the context?

else
{
DoSanitize(document, context, baseUrl);
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to DoSanitize(document, context ?? document, baseUrl);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout

@mganss
Copy link
Owner

mganss commented Jul 31, 2020

Just add a simple test with minimal HTML so that coverage is maintained. One call with context = null, one with a context.

/// <returns>The sanitized HTML Document.</returns>
public IHtmlDocument SanitizeDom(IHtmlDocument document, IHtmlElement context = null, string baseUrl = "")
{
var styling = document.Context.GetCssStyling();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup did not mean to leave that in

@The-Nutty
Copy link
Contributor Author

I have added a few basic tests to maintain coverage and addressed all the comments. Let me know if there is anything else you spot

@@ -578,6 +578,20 @@ public IHtmlDocument SanitizeDom(string html, string baseUrl = "")
return dom;
}

/// <summary>
/// Sanitizes the specified parsed HTML body fragment.
/// The Document must have been parsed with CSS support.
Copy link
Owner

@mganss mganss Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Document must have been parsed with CSS support.

This is not strictly necessary I believe. If the document was not parsed with CSS support and thus does not contain style information of course HtmlSanitizer can't magically re-add it. If you want the sanitization process to drop all style then leaving out CSS support might actually be a good thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about If the document has not been parsed with CSS support then all styles will be removed instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed that up

@mganss mganss merged commit 987b5cb into mganss:master Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants