-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new SanitizeDom method taking an IHtmlDocument as a param #235
Conversation
Codecov Report
@@ Coverage Diff @@
## master #235 +/- ##
==========================================
+ Coverage 91.61% 91.68% +0.06%
==========================================
Files 4 4
Lines 620 625 +5
Branches 80 81 +1
==========================================
+ Hits 568 573 +5
Misses 42 42
Partials 10 10
Continue to review full report at Codecov.
|
Thanks. I think it would cover more use cases if we also surfaced the context parameter, perhaps null by default meaning the whole document will be sanitized. Also please add a few basic unit tests. |
src/HtmlSanitizer/HtmlSanitizer.cs
Outdated
/// <summary> | ||
/// Sanitizes the specified parsed HTML body fragment. | ||
/// The Document Must have been pared with CSS support and the following options enabled | ||
/// "IsIncludingUnknownDeclarations", "IsIncludingUnknownRules" and "IsToleratingInvalidSelectors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you pass a document that was not parsed using these options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are not using WithCss
then IElement.GetStyle() will always return null meaning all style attributes are stripped. I think it will likewise effect parsing style sheets although i have not tested that.
Looks like document.Context.GetCssStyling();
well be null if css has not been configured so that might be worth throwing if its null?
As for the individual options i believe angle sharp will strip the unrecognised part when parsing them, in which case this could be removed? As it should not effect the Sanitizer's operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the caller's problem if all style will be stripped unless specific parsing options were used. Does something throw in HtmlSanitizer if these options are left out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the style attribute it will fail silently, im not sure about style sheets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit tests run fine w/o these options. It seems they are only there for legacy reasons. If you leave out CSS support a number of tests fail of course but none throw (except the threads test but that's expected as it's just a meta test). So I think we can remove the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok i have removed the last line of the comment
src/HtmlSanitizer/HtmlSanitizer.cs
Outdated
/// The Document Must have been pared with CSS support and the following options enabled | ||
/// "IsIncludingUnknownDeclarations", "IsIncludingUnknownRules" and "IsToleratingInvalidSelectors" | ||
/// </summary> | ||
/// <param name="document">The pared HTML Document.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> parsed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still showing "pared".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it in one place not the other 4...
I have made what i believe is the context change as you discussed, can you confirm thats what you where after? On the unit tests, im happy to add some, what did you have in mind as all the existing ones seem to be testing that it santizes things correctly, should i just copy one/a few existing tests and pre parse the html? If we are throwing if Css has not been configured then thats an easy one. And some that based on the context passed in verifies that things are/are not santized depending on the context? |
src/HtmlSanitizer/HtmlSanitizer.cs
Outdated
else | ||
{ | ||
DoSanitize(document, context, baseUrl); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified to DoSanitize(document, context ?? document, baseUrl);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout
Just add a simple test with minimal HTML so that coverage is maintained. One call with |
src/HtmlSanitizer/HtmlSanitizer.cs
Outdated
/// <returns>The sanitized HTML Document.</returns> | ||
public IHtmlDocument SanitizeDom(IHtmlDocument document, IHtmlElement context = null, string baseUrl = "") | ||
{ | ||
var styling = document.Context.GetCssStyling(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup did not mean to leave that in
I have added a few basic tests to maintain coverage and addressed all the comments. Let me know if there is anything else you spot |
src/HtmlSanitizer/HtmlSanitizer.cs
Outdated
@@ -578,6 +578,20 @@ public IHtmlDocument SanitizeDom(string html, string baseUrl = "") | |||
return dom; | |||
} | |||
|
|||
/// <summary> | |||
/// Sanitizes the specified parsed HTML body fragment. | |||
/// The Document must have been parsed with CSS support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Document must have been parsed with CSS support.
This is not strictly necessary I believe. If the document was not parsed with CSS support and thus does not contain style information of course HtmlSanitizer can't magically re-add it. If you want the sanitization process to drop all style then leaving out CSS support might actually be a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about If the document has not been parsed with CSS support then all styles will be removed
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed that up
Why
In my use case I have already used anglesharp to parse the document before calling
.SanitizeDocument
and i need to do more work after calling.SanitizeDocument
. This means that in this section anglesharp is parsing the document 3 times, on a large document with CSS support that can be quite slow.What
I have added an extra external API with the same
SanitizeDom
name as it seemed to fit (not sure ifs its technically the correct place however).Potential issues