Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: relink_namespaces handles attributes properly #2310

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

flavorjones
Copy link
Member

@flavorjones flavorjones commented Aug 14, 2021

What problem is this PR intended to solve?

See #1790 for context.

Skip relinking namespaces in HTML4/5 documents

Also fix both implementations of relink_namespaces to not skip
attributes when the node itself doesn't have a namespace.

Finally, the private method
XML::Node#add_child_node_and_reparent_attrs is no longer needed after
fixing relink_namespaces, so this commit essentially reverts 9fd03d8.

Have you included adequate test coverage?

I've added tests for HTML reparenting to ensure namespaces aren't linked.

Does this change affect the behavior of either the C or the Java implementations?

Both implementations are kept in sync with each other here.

@flavorjones
Copy link
Member Author

I've added a commit that introduces test coverage for the HTML5 behavior from 7b3c972. This currently fails, will make it pass shortly.

@flavorjones flavorjones force-pushed the 1790-better-attr-renamespacing branch from 764a1f9 to 1837c0e Compare August 14, 2021 16:46
@flavorjones
Copy link
Member Author

Should be green now, I'm curious what you think @stevecheckoway

@flavorjones flavorjones added this to the v1.13.0 milestone Aug 14, 2021
@flavorjones flavorjones force-pushed the 1790-better-attr-renamespacing branch from 1837c0e to cf0392e Compare August 14, 2021 17:00
@stevecheckoway
Copy link
Contributor

Hey @flavorjones, first let me apologize if I've missed other PRs or issues in which you've mentioned me. I need to get my GitHub settings squared away. Every time one of my students creates a repository (so once per assignment), GitHub adds it to my watch list. I'm apparently watching 1300 repos right now! The emails are overwhelming.

Namespaces in HTML documents is an area that I'm really pretty weak on. I found the HTML standard confusing and I haven't taken the time to dig in to what browsers do. HTML documents have nodes that live in one of six namespaces. HTML elements live in the HTML namespace, MathML elements live in the MathML namespace, and SVG elements live in the SVG namespace. The XLink, XML, and XMLNS namespaces are for attributes.

During parsing, math and svg start tags cause math and svg elements to be inserted in the MathML and SVG namespaces, respectively. Several places require attributes on foreign (i.e., not HTML) elements to adjust attributes such that some end up in a namespace. Normally, HTML attributes should not have namespaces. (I have no idea if we handle that correctly or not currently.)

Here's an example showing what I think should happen.

<!DOCTYPE html>
<svg viewBox="0 0 200 100" xmlns="http://www.w3.org/2000/svg">
  <title><span xml:lang="en-US">title</span></title>
  <text xml:lang="en-US" foo:bar="blah">This is some English text</text>
</svg>
<p xml:lang="en-US">More text.
  • The (parser-inserted) html, head, and body elements are in the HTML namespace with null prefixes.
  • The svg element is in the SVG namespace with null prefix.
  • The viewBox attribute has no namespace.
  • The xmlns attribute is in the XMLNS namespace with null prefix.
  • The title element is in the SVG namespace with null prefix.
  • The span element is in the HTML namespace with null prefix
  • The span's xml:lang attribute has local name xml:lang and no namespace.
  • The text element is in the SVG namespace with null prefix.
  • The text's xml:lang attribute is in the XML namespace with prefix xml and local name lang.
  • The text's foo:bar attribute has local name foo:bar and no namespace.
  • The p element is in the HTML namespace with null prefix.
  • The p's xml:lang attribute has local name xml:lang and no namespace.

I've bolded the parts that might be surprising.

As you can see, it's quite complicated. Some foreign elements, and only in some circumstances (namely when one of these conditions is satisfied), can contain HTML elements whose attributes never have namespaces and can contain colons. And only some attributes with colons in the name in foreign elements end up with namespaces.

(It's probably worth noting that this impacts XPath and XSLT.)

So the question becomes what behavior should the Nokogiri API have when adding nodes whose attributes contain colons?

As a point of comparison, I played around with JavaScript a little bit. If I create a 'svg' element, it puts it in the HTML namespace unless I explicitly tell it the namespace to use. If I create an attribute with the name xml:lang, it treats it as an attribute with no namespace. If I add that attribute to the svg element in the SVG namespace, it doesn't adjust the attributes.

element = document.createElementNS("http://www.w3.org/2000/svg", "svg");
element.namespaceURI; // "http://www.w3.org/2000/svg"
attr = document.createAttribute("xml:lang");
attr.localName; // "xml:lang"
attr.prefix; // null
attr.namespaceURI; // null
element.setAttributeNode(attr)
element.attributes[0].localName; // "xml:lang"
element.attributes[0].prefix; // null
element.attributes[0].namespaceURI; // null

I'd be inclined to follow that behavior. If you want an element/attribute in a namespace, you gotta construct it in that namespace.

@flavorjones
Copy link
Member Author

@stevecheckoway Thanks for the context dump, and no need to apologize for notification fatigue.

I'm going to take some time to read everything you wrote and make sure I understand it, because my lack of knowledge of HTML5 is showing.

@flavorjones flavorjones modified the milestones: v1.13.0, v1.14.0 Jan 6, 2022
@aditiagarw4722

This comment was marked as off-topic.

This corresponds to the code added in 7b3c972

Note also that this commit adds the same test for HTML4, commented out
because it currently fails, that we will make pass in the next few
commits.
Also fix both implementations of relink_namespaces to not skip
attributes when the node itself doesn't have a namespace.

Finally, the private method
XML::Node#add_child_node_and_reparent_attrs is no longer needed after
fixing relink_namespaces, so this commit essentially reverts 9fd03d8.
@flavorjones flavorjones force-pushed the 1790-better-attr-renamespacing branch from cf0392e to 8dde633 Compare August 30, 2022 16:23
@flavorjones
Copy link
Member Author

Rebased onto currrent main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants