Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide an easy way to validate an entire document and not only its body #2044

Open
robstoll opened this issue Nov 9, 2023 · 7 comments
Open

Comments

@robstoll
Copy link

robstoll commented Nov 9, 2023

consider the following:

Document document = Parser.htmlParser().parseInput("<html><head><script>alert('xss');</script></head><body>all good</body></html>", "");
Cleaner(Safelist.none().addTags("title")).isValid(document);

returns false but not because we added script to head but because there is this check in isValid:

&& dirtyDocument.head().childNodes().isEmpty(); // because we only look at the body, but we start from a shell, make sure there's nothing in the head

I would have expected isValid analyses everything as it a) takes a Document and b) is not named isValidBodyHtml in contrast to the other available method.
Since CleaningVisitor is private, I cannot pass html element instead of the body. Suggestions how I can achieve what I want?

(I suggest you deprecate isValid and add another overload isValidBodyHtml which takes a Document)

@jhy
Copy link
Owner

jhy commented Nov 10, 2023

Yes, the Cleaner only takes content from the body. Per the class documentation:

It is assumed that the input HTML is a body fragment; the clean methods only pull from the source's body, and the
canned safe-lists only allow body contained tags.

I'm not sure what you mean by

returns false but not because we added script to head

It is returning false because there is content in the head.

(I suggest you deprecate isValid and add another overload isValidBodyHtml which takes a Document)

But in that case the name would imply it is vetting the HTML which it would not be. The isValidBodyHtml checks both that there are not parser errors and that nothing was removed by the Cleaner. The isValid(Document) can only test that the Cleaner has not removed anything.

Taking a step back, can you tell me more about what you're trying to achieve, what your input HTML is, and what you consider valid / invalid?

In a world where the Cleaner does support full documents (which I am happy to consider supporting at some point), it seems that your configured Cleaner (just supporting "title") tags would always be invalid, as it will always contain html, head, body.

@jhy
Copy link
Owner

jhy commented Nov 10, 2023

Ref also #1520

@jhy
Copy link
Owner

jhy commented Nov 10, 2023

(BTW, thank you for the feedback!)

Maybe an approach would be that the Cleaner checks what tags are in the safelist. And if it sees {html | head | body}, act on the whole document, and if not, act as now in body fragment mode.

That way there's not another method or configurator, and there's no deprecation. Existing implementations keep the same behavior (body fragment) as adding non-body tags doesn't work currently.

@robstoll
Copy link
Author

robstoll commented Nov 10, 2023

Sorry, should have wrote down the way I came here which was kind of via three corners. Because, in fact, I don't need whole document validation for my current use case I was only quite confused just looking at the method isValid(Document) to find out it only checks the body.

how I came here:

  1. I need to parse HTML to process it further
  2. I still want to check that it is valid in terms of used nodes, attributes
  3. I don't want to parse twice => I create the Parser by my own instead via Jsoup...
  4. parser.parseInput("<!-- will be before HTML --> will be in body") => first figured Safelist.none() does not report on comments correctly but it does as long as it is in the body
  5. now realised that Cleaner.isValid(Document) only validates the body

IMO we can close this issue. Maybe we should create a new one which improves the check in Cleaner.isValid
so that it also catches if comments are define before the domlike:

// because we only look at the body, but we start from a shell, make sure there's nothing in the head...
&& dirtyDocument.head().childNodes().isEmpty()
// and neither before the HTML 
&& dirtyDocument.childNodes().size == 1

We might want to allow DocumentType before HTML? But maybe it would be better to adjust parseInput so that the comment lands in the body instead of before the <html> tag?

In any case, I use now parseFragmentInput instead and no longer suffer this problem.
I hope this gives more insight why I created the issue and somewhat sorry for the noise, feel free to close the issue.

@jhy
Copy link
Owner

jhy commented Nov 15, 2023

Thanks for the detail -- makes sense and helps understand the requirement.

I do want to overhaul the Cleaner interface to enable optionally cleaning whole documents vs body fragments. Thinking that might simply be enabled by adding head elements to the safelist - if those are present, flip to document vs body fragment mode. And also adding more control of what happens to dropped elements - unwrapped, removed, escaped, etc. (See #2050).

Will leave this as a tracker bug for now while planning the revision.

@jhy
Copy link
Owner

jhy commented Nov 16, 2023

I think another requirement here is that in one parse you can get both the cleaned Document and also the isValid response (the combination of if the parser saw any errors, and if any elements / attributes were dropped during cleaning).

Another ask I've seen is for a report of what elements / attributes were dropped.

Thinking then that there's a more detailed response object required from the cleaner that can provide that level of detail along with the response document.

@wnm3

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants