Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put W3C DOM elements in HTML namespace by default for jhy/jsoup#1837. #1848

Merged
merged 5 commits into from
May 6, 2023
Merged

Put W3C DOM elements in HTML namespace by default for jhy/jsoup#1837. #1848

merged 5 commits into from
May 6, 2023

Conversation

garretwilson
Copy link
Contributor

Fixes #1837.

Element el = namespace == null && tagName.contains(":") ?
doc.createElementNS("", tagName) : // doesn't have a real namespace defined
doc.createElementNS(namespace, tagName);
// use an empty namespace if none is present but the tag name has a prefix
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change any functionality here. The modification makes it clear that the only thing happening is that you're "imputing" a namespace of the empty string under certain conditions — you're not calling a separate method. The original code duplicated the call to doc.createElementNS(), which primarily was confusing and obscured the purpose of the logic.

@garretwilson
Copy link
Contributor Author

I don't know why some of the integration tests are failing on macOS, but from the output it doesn't appear to be related to any changes here. My changes shouldn't have anything to do with integration-level issues, and the unit tests are all working.

Copy link
Owner

@jhy jhy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Please see a couple notes inline and let me know if you have any thoughts.

LMK if you want to proceed with the changes, otherwise I am happy to apply them also.

@@ -348,6 +387,7 @@ protected static class W3CBuilder implements NodeVisitor {
public W3CBuilder(Document doc) {
this.doc = doc;
namespacesStack.push(new HashMap<>());
namespacesStack.peek().put("", "http://www.w3.org/1999/xhtml"); // TODO document
Copy link
Owner

@jhy jhy Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we should only be setting this default namespace if the input document is HTML. If it's XML, then the HTML -> XML compat note doesn't apply, and we shouldn't be applying it.

The impl could be something like

if (inDoc.parser().getTreeBuilder() instanceof HtmlTreeBuilder)

Would need to move this to the appropriate convert() method. Or, add that flag as user data to the W3C Document.

I'm not sure if checking the treebuilder is the best way to see if it's HTML vs XML. The output syntax is how we normally check that, but that doesn't tell us particularly if the input was HTML, and people may be setting the syntax to XML before calling these methods. So am leaning towards checking treebuilder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we should only be setting this default namespace if the input document is HTML.

That makes sense. I don't know enough about your library's API to know when it thinks it's parsing XML. I had assumed it only parsed things considered to be non-XML HTML. (Otherwise I would think you could just use an off-the-shelf XML parser because the input would not be "soup".)

So am leaning towards checking treebuilder.

You know best here. All I care about is when I parse an HTML document, the HTML namespace gets imputed as per the space. If the tree builder is always set to HtmlTreeBuilder in those cases, that works. Whatever you want to do.

Would need to move this to the appropriate convert() method. Or, add that flag as user data to the W3C Document.

That sounds like a lot of shuffling. I note that the W3CBuilder does seem to get passed the org.jsoup.nodes.Element being converted via the W3C document user data context property. In normal cases that element would know its owner org.jsoup.nodes.Document would it not? (If not, can we assume this is HTML input? I would imagine it to be a rare case anyway for the jsoup element not to have an owner document.)

So what about this inside the W3CBuilder constructor (where I made the change already)?

final org.jsoup.nodes.Document inDoc = contextElement.ownerDocument();
if (inDoc == null || inDoc.parser().getTreeBuilder() instanceof HtmlTreeBuilder) {
  namespacesStack.peek().put("", "http://www.w3.org/1999/xhtml"); // TODO document
}

(I see that I had forgotten to document the actual namespace imputation line. I'll add that once I get an OK from you on how to proceed.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhy could you respond here and let me know which way to go, so that I can finish this PR and get it accepted?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a lot of shuffling. I note that the W3CBuilder does seem to get passed the org.jsoup.nodes.Element being converted via the W3C document user data context property. In normal cases that element would know its owner org.jsoup.nodes.Document would it not? (If not, can we assume this is HTML input? I would imagine it to be a rare case anyway for the jsoup element not to have an owner document.)

OK, using the contextElement is a good idea. And yes let's go with checking if the builder was HtmlTreeBuilder. That will work for all cases in the core library (vs anyone's own extensions, which is kind of up them).

I would not assume that an Element without an ownerDocument is HTML though, so we should not add the namespace if it is null. There will be no ownerDoc for Elements constructed via new and not attached to another doc, which is a supported use case. And in that case we don't know either way.

I plan on making this namespace inference a defaulted-on option in the W3C DOM. This will allow existing uses who have working code without a namespace (for e.g. simple xpath queries) a migration path, or for new uses that prefer it otherwise.

static String removeDefaultHtmlNamespaceDeclaration(String html) {
Matcher matcher = HTML_DEFAULT_NAMESPACE_PATTERN.matcher(html);
if (matcher.find()) {
html = html.substring(0, matcher.start(1)) + html.substring(matcher.end(1));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to leave the introduced namespace in the output. It makes the output XML clearer as to what's happened, and may help other downstream parses. So the regex components etc can be removed.

We can leave it like that or play tricks to get rid of it, but it depends on the purpose and usage of W3CDom.asString(). Is it just used for testing? Or is it part of the jsoup API for turning DOM into a string?

It's part of the API.

jsoup is about parsing, not pretty-printing

I consider jsoup to be also about pretty-printing. And we do introduce other tokens throughout jsoup as a convenience (e.g. xml declarations, meta charsets, the HTML document structure, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to leave the introduced namespace in the output.

For my purposes I don't care one way or another, and it's certainly easier just to leave it in. At the time, due to lack of response to the ticket, I had to make a choice and do what I thought would raise the chance of the PR being requested. (You can see all the time and painful research on Stack Overflow to remove the declaration.) I'll remove the code that removed the declaration in the output.

@jhy
Copy link
Owner

jhy commented Jan 24, 2023

I don't know why some of the integration tests are failing on macOS, but from the output it doesn't appear to be related to any changes here. My changes shouldn't have anything to do with integration-level issues, and the unit tests are all working.

Yep don't worry about those - likely to have just been GitHub not having enough Mac resources on hand when the tests ran to complete some of the timing tests within the limit.

@garretwilson
Copy link
Contributor Author

OK, using the contextElement is a good idea. And yes let's go with checking if the builder was HtmlTreeBuilder. That will work for all cases in the core library (vs anyone's own extensions, which is kind of up them).

@jhy I've made the changes you've requested. Sorry for the delay. I merged in your latest changes from master so this pull request should now be completely up to date.

Can you can review this and merge it soon? I really don't want for it to get stale, and each time I come back to it after a couple of months it's harder to re-orient myself to where I left it. 😅

Let me know if you need further changes.

@garretwilson garretwilson requested a review from jhy April 13, 2023 21:33
@jhy jhy merged commit 4a278e9 into jhy:master May 6, 2023
12 checks passed
jhy added a commit that referenced this pull request May 6, 2023
@jhy jhy added the improvement label May 6, 2023
@jhy jhy added this to the 1.16.2 milestone May 6, 2023
@jhy
Copy link
Owner

jhy commented May 6, 2023

Thanks, merged! Appreciate your perseverance in getting it completed.

I added a test that the converter is in namespace aware mode before applying it, so that it can be optionally disabled.

@garretwilson
Copy link
Contributor Author

Woohoo!! Thanks; this is exciting.

@garretwilson garretwilson deleted the issues/jhy/jsoup#1837 branch May 6, 2023 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: DOM elements not being placed in (X)HTML namespace.
2 participants