Put W3C DOM elements in HTML namespace by default for jhy/jsoup#1837. #1848

garretwilson · 2022-10-01T17:30:34Z

garretwilson · 2022-10-01T17:32:58Z

src/main/java/org/jsoup/helper/W3CDom.java

-                    Element el = namespace == null && tagName.contains(":") ?
-                        doc.createElementNS("", tagName) : // doesn't have a real namespace defined
-                        doc.createElementNS(namespace, tagName);
+                    // use an empty namespace if none is present but the tag name has a prefix


I didn't change any functionality here. The modification makes it clear that the only thing happening is that you're "imputing" a namespace of the empty string under certain conditions — you're not calling a separate method. The original code duplicated the call to doc.createElementNS(), which primarily was confusing and obscured the purpose of the logic.

garretwilson · 2022-10-01T17:46:11Z

I don't know why some of the integration tests are failing on macOS, but from the output it doesn't appear to be related to any changes here. My changes shouldn't have anything to do with integration-level issues, and the unit tests are all working.

jhy

Thanks for the PR. Please see a couple notes inline and let me know if you have any thoughts.

LMK if you want to proceed with the changes, otherwise I am happy to apply them also.

jhy · 2023-01-24T05:32:27Z

src/main/java/org/jsoup/helper/W3CDom.java

@@ -348,6 +387,7 @@ protected static class W3CBuilder implements NodeVisitor {
        public W3CBuilder(Document doc) {
            this.doc = doc;
            namespacesStack.push(new HashMap<>());
+            namespacesStack.peek().put("", "http://www.w3.org/1999/xhtml"); // TODO document


I feel that we should only be setting this default namespace if the input document is HTML. If it's XML, then the HTML -> XML compat note doesn't apply, and we shouldn't be applying it.

The impl could be something like

if (inDoc.parser().getTreeBuilder() instanceof HtmlTreeBuilder)

Would need to move this to the appropriate convert() method. Or, add that flag as user data to the W3C Document.

I'm not sure if checking the treebuilder is the best way to see if it's HTML vs XML. The output syntax is how we normally check that, but that doesn't tell us particularly if the input was HTML, and people may be setting the syntax to XML before calling these methods. So am leaning towards checking treebuilder.

I feel that we should only be setting this default namespace if the input document is HTML.

That makes sense. I don't know enough about your library's API to know when it thinks it's parsing XML. I had assumed it only parsed things considered to be non-XML HTML. (Otherwise I would think you could just use an off-the-shelf XML parser because the input would not be "soup".)

So am leaning towards checking treebuilder.

You know best here. All I care about is when I parse an HTML document, the HTML namespace gets imputed as per the space. If the tree builder is always set to HtmlTreeBuilder in those cases, that works. Whatever you want to do.

Would need to move this to the appropriate convert() method. Or, add that flag as user data to the W3C Document.

That sounds like a lot of shuffling. I note that the W3CBuilder does seem to get passed the org.jsoup.nodes.Element being converted via the W3C document user data context property. In normal cases that element would know its owner org.jsoup.nodes.Document would it not? (If not, can we assume this is HTML input? I would imagine it to be a rare case anyway for the jsoup element not to have an owner document.)

So what about this inside the W3CBuilder constructor (where I made the change already)?

final org.jsoup.nodes.Document inDoc = contextElement.ownerDocument(); if (inDoc == null || inDoc.parser().getTreeBuilder() instanceof HtmlTreeBuilder) { namespacesStack.peek().put("", "http://www.w3.org/1999/xhtml"); // TODO document }

(I see that I had forgotten to document the actual namespace imputation line. I'll add that once I get an OK from you on how to proceed.)

@jhy could you respond here and let me know which way to go, so that I can finish this PR and get it accepted?

That sounds like a lot of shuffling. I note that the W3CBuilder does seem to get passed the org.jsoup.nodes.Element being converted via the W3C document user data context property. In normal cases that element would know its owner org.jsoup.nodes.Document would it not? (If not, can we assume this is HTML input? I would imagine it to be a rare case anyway for the jsoup element not to have an owner document.)

OK, using the contextElement is a good idea. And yes let's go with checking if the builder was HtmlTreeBuilder. That will work for all cases in the core library (vs anyone's own extensions, which is kind of up them).

I would not assume that an Element without an ownerDocument is HTML though, so we should not add the namespace if it is null. There will be no ownerDoc for Elements constructed via new and not attached to another doc, which is a supported use case. And in that case we don't know either way.

I plan on making this namespace inference a defaulted-on option in the W3C DOM. This will allow existing uses who have working code without a namespace (for e.g. simple xpath queries) a migration path, or for new uses that prefer it otherwise.

jhy · 2023-01-24T05:36:35Z

src/main/java/org/jsoup/helper/W3CDom.java

+    static String removeDefaultHtmlNamespaceDeclaration(String html) {
+        Matcher matcher = HTML_DEFAULT_NAMESPACE_PATTERN.matcher(html);
+        if (matcher.find()) {
+          html = html.substring(0, matcher.start(1)) + html.substring(matcher.end(1));


I think it is better to leave the introduced namespace in the output. It makes the output XML clearer as to what's happened, and may help other downstream parses. So the regex components etc can be removed.

We can leave it like that or play tricks to get rid of it, but it depends on the purpose and usage of W3CDom.asString(). Is it just used for testing? Or is it part of the jsoup API for turning DOM into a string?

It's part of the API.

jsoup is about parsing, not pretty-printing

I consider jsoup to be also about pretty-printing. And we do introduce other tokens throughout jsoup as a convenience (e.g. xml declarations, meta charsets, the HTML document structure, etc).

I think it is better to leave the introduced namespace in the output.

For my purposes I don't care one way or another, and it's certainly easier just to leave it in. At the time, due to lack of response to the ticket, I had to make a choice and do what I thought would raise the chance of the PR being requested. (You can see all the time and painful research on Stack Overflow to remove the declaration.) I'll remove the code that removed the declaration in the output.

jhy · 2023-01-24T05:40:37Z

I don't know why some of the integration tests are failing on macOS, but from the output it doesn't appear to be related to any changes here. My changes shouldn't have anything to do with integration-level issues, and the unit tests are all working.

Yep don't worry about those - likely to have just been GitHub not having enough Mac resources on hand when the tests ran to complete some of the timing tests within the limit.

garretwilson · 2023-04-13T21:29:32Z

OK, using the contextElement is a good idea. And yes let's go with checking if the builder was HtmlTreeBuilder. That will work for all cases in the core library (vs anyone's own extensions, which is kind of up them).

@jhy I've made the changes you've requested. Sorry for the delay. I merged in your latest changes from master so this pull request should now be completely up to date.

Can you can review this and merge it soon? I really don't want for it to get stale, and each time I come back to it after a couple of months it's harder to re-orient myself to where I left it. 😅

Let me know if you need further changes.

jhy · 2023-05-06T02:30:09Z

Thanks, merged! Appreciate your perseverance in getting it completed.

I added a test that the converter is in namespace aware mode before applying it, so that it can be optionally disabled.

garretwilson · 2023-05-06T02:31:28Z

Woohoo!! Thanks; this is exciting.

Put W3C DOM elements in HTML namespace by default for #1837.

350aadf

garretwilson commented Oct 1, 2022

View reviewed changes

jhy requested changes Jan 24, 2023

View reviewed changes

garretwilson added 3 commits January 31, 2023 11:27

Included imputed HTML namespace in serialization for #1837.

89d022d

Impute HTML namespace only if input document is HTML for #1837.

859b8aa

Merge branch 'master' into issues/#1837

d45b6e9

garretwilson requested a review from jhy April 13, 2023 21:33

Use namespaceAware to choose if to set XHTML namespace

a7f9236

jhy merged commit 4a278e9 into jhy:master May 6, 2023
12 checks passed

jhy added a commit that referenced this pull request May 6, 2023

Changelog for #1848

f284d35

jhy added the improvement label May 6, 2023

jhy added this to the 1.16.2 milestone May 6, 2023

garretwilson deleted the issues/jhy/jsoup#1837 branch May 6, 2023 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put W3C DOM elements in HTML namespace by default for jhy/jsoup#1837. #1848

Put W3C DOM elements in HTML namespace by default for jhy/jsoup#1837. #1848

garretwilson commented Oct 1, 2022

garretwilson Oct 1, 2022 •

edited

garretwilson commented Oct 1, 2022

jhy left a comment

jhy Jan 24, 2023 •

edited

garretwilson Jan 31, 2023

garretwilson Feb 9, 2023

jhy Feb 18, 2023

jhy Jan 24, 2023

garretwilson Jan 31, 2023

jhy commented Jan 24, 2023

garretwilson commented Apr 13, 2023

jhy commented May 6, 2023

garretwilson commented May 6, 2023

Put W3C DOM elements in HTML namespace by default for jhy/jsoup#1837. #1848

Put W3C DOM elements in HTML namespace by default for jhy/jsoup#1837. #1848

Conversation

garretwilson commented Oct 1, 2022

garretwilson Oct 1, 2022 • edited

Choose a reason for hiding this comment

garretwilson commented Oct 1, 2022

jhy left a comment

Choose a reason for hiding this comment

jhy Jan 24, 2023 • edited

Choose a reason for hiding this comment

garretwilson Jan 31, 2023

Choose a reason for hiding this comment

garretwilson Feb 9, 2023

Choose a reason for hiding this comment

jhy Feb 18, 2023

Choose a reason for hiding this comment

jhy Jan 24, 2023

Choose a reason for hiding this comment

garretwilson Jan 31, 2023

Choose a reason for hiding this comment

jhy commented Jan 24, 2023

garretwilson commented Apr 13, 2023

jhy commented May 6, 2023

garretwilson commented May 6, 2023

garretwilson Oct 1, 2022 •

edited

jhy Jan 24, 2023 •

edited