Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some entities in text content are not converted/escaped when serializing #58

Closed
cburatto opened this issue Jun 24, 2020 · 15 comments · Fixed by #395
Closed

Some entities in text content are not converted/escaped when serializing #58

cburatto opened this issue Jun 24, 2020 · 15 comments · Fixed by #395
Labels
bug Something isn't working spec:DOM-Parsing xml:valid https://www.w3.org/TR/xml11/#dt-valid
Milestone

Comments

@cburatto
Copy link

cburatto commented Jun 24, 2020

I have noticed that node values/strings containing > are not converted to > when serializing the DOM. Is this due to any flag I should be using?

@karfau
Copy link
Member

karfau commented Jun 24, 2020

Is the second > in your question written as >? (Surround it with back ticks to prevent github from rendering it as >.)
If so:
I'm not aware of any "flags" that would change behavior of handling > in those cases.
I'm aware that xmldom is not doing that currently but (because I'm working on restoring the tests) I'm confident that the parsing is still correctly done.

I think having something that looks like a closing tag bracket in text or values is "uncritical" for XML parsers, which can not be said about < (these need to be converted for the parsing to work).

I have also seen that other parsers (at least one of the abandoned domjs or libxmljs) treat those things in a more consistent manner.

If we would decide to change that (after having a stable test suite running on every change), even if it's not really a difference (from my current understanding) it might be considered a breaking change.

So my guess is such a change might not come very soon.

@cburatto
Copy link
Author

Thanks -- I do mean a tag bracket character. For example, if a node value contains HTML text where entities have been converted. You load it with the DOMParser, then when you XMLSerialize it the entities are reconverted, except the closing bracket >

Example:
<xmlelement>&lt;b&gt;This is bold&lt;/b&gt;</xmlelement>

gets serialized as
<xmlelement>&lt;b>This is bold&lt;/b></xmlelement>

@karfau
Copy link
Member

karfau commented Jun 25, 2020

Yes. That's what I was also referring to. But another short search in the old repo shows that there are already multiple issues around it there and also PRs.
I linked the "main thread" above.

@sarod
Copy link

sarod commented Dec 11, 2020

According to XML spec https://www.w3.org/TR/xml/#NT-CharData
CharData cannot contain the string "]]>" which is reserved to mark the end of a CDATA section.

Quote from spec

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".

So > needs to be serialized as &gt; at least when following ]] so that ]]> is serialized as ]]&gt; to avoid this issue

@karfau
Copy link
Member

karfau commented Jan 19, 2021

@sarod Do I understand you correctly that you agree to a "subset" of the original question, namely for the case of ]]> that needs to be converted to ]]%gt;?

Do you think we should just always convert it, even thought he specification you are quoting also says (emphasis mine)

The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

?

I think I'm convinced that we should take care of the ]]> case.
But for anything else I would love to see a failing test (either in a comment or as a PR or as link to any repo).

@karfau
Copy link
Member

karfau commented Jan 21, 2021

The only thing left to do here is to add (non standard) options to the serializer / toString methods to configure which characters to encode or not encode in different contexts. Which might need some discussions on how to implement it in a flexible and non breaking/disruptive manner.

But I don't consider this a very important feature compared to other topics.
(PRs with tests are of course welcome.)

@karfau
Copy link
Member

karfau commented Jan 21, 2021

There is some more related information in #22 which is the older one but I will still mark that one duplicate.

@karfau karfau changed the title 'Larger Than' (>) not converted to entity when XMLSerialized Some entities in text content are not converted/escaped when serializing Jan 21, 2021
@karfau karfau added the xml:valid https://www.w3.org/TR/xml11/#dt-valid label Jan 21, 2021
@sarod
Copy link

sarod commented Jan 21, 2021

Sorry for the late answer

@sarod Do I understand you correctly that you agree to a "subset" of the original question, namely for the case of ]]> that needs to be converted to ]]%gt;?

Yes.

Do you think we should just always convert it?
The way I understand the xml spec is that outside of sequence ']]>' in a cdata content both '>,' and '>' are valid.

Limiting the conversion to the ']]>' case as the advantage of minimizing the changes in generated xml and so is likely to cause less breaking changes for consumers of the library. So that would be my recommendation but the choice is yours.

@karfau
Copy link
Member

karfau commented Jan 23, 2021

@sarod Thx for the clarification.
This change already landed on master: #181
and we are planning the next release https://github.com/xmldom/xmldom/milestone/3

This issue is left open just for the more general topic of controlling the general behavior of serializing entities, hence the changed title.

@SheetJSDev
Copy link

@karfau the spec seems to suggest that > must be encoded in XMLSerializer#serializeToString:

https://www.w3.org/TR/2016/WD-DOM-Parsing-20160517/ is the spec covering the method in question.

XMLSerializer#serializeToString: "produce an XML serialization"

step 5: Return the result of running the XML serialization algorithm

step 14: Append to markup the result of the XML serialization of node's attributes

step 9.3: The result of serializing an attribute value given attr's value attribute

step 3: Text:

"""

Otherwise, attribute value is a string. Return the value of attribute value, first replacing any occurrences of the following:

  1. " with &quot;
  2. & with &amp;
  3. < with &lt;
  4. > with &gt;

NOTE

This matches behavior present in browsers, and goes above and beyond the grammar requirement in the XML specification's AttValue production [XML10] by also replacing ">" characters.
"""

The correct parsing of the spec is that XML10 did not require > in attribute values to be escaped, but XMLSerializer#serializeToString does.

@karfau
Copy link
Member

karfau commented Apr 4, 2022

@SheetJSDev Thank you for digging deeper.

<tagname attribute_key="attribute value">text content</tagname>

But the lines you quote are talking about attribute values (where xmldom already implements it), not about text content.
But I also looked up those, and you are also right regarding >:
image
(It's part of this section, but requires some scrolling.)

I will update the labels accordingly.

@karfau karfau added bug Something isn't working spec:DOM-Parsing and removed spec:no standard labels Apr 4, 2022
@karfau karfau modified the milestones: planning 1.0.0, before 1.0.0 Apr 4, 2022
@SheetJSDev
Copy link

A bunch of issues related to > link here, so the conversation is confusion. The deep dive concerns the following:

new XMLSerializer().serializeToString(new DOMParser().parseFromString('<foo bar="&gt;"/>', 'text/xml').documentElement)

In Chrome:

> new XMLSerializer().serializeToString(new DOMParser().parseFromString('<foo bar="&gt;"/>', 'text/xml').documentElement)
< '<foo bar="&gt;"/>'

In version 0.8.1:

> const { XMLSerializer, DOMParser } = require("@xmldom/xmldom")
undefined
> new XMLSerializer().serializeToString(new DOMParser().parseFromString('<foo bar="&gt;"/>', 'text/xml').documentElement)
'<foo bar=">"/>'

@karfau
Copy link
Member

karfau commented Apr 5, 2022

You are right, it looks like I confused myself with all the different threads on this issue.
So far we only took care of <, & and whitespace in attributes and the special case of ]]> in text content.

With the links you provided it makes sense to me that his is a bug and it will be fixed soon.

Thank you for insisting.

https://stackblitz.com/edit/js-xmldom58?devToolsHeight=33&file=index.js
https://w3c.github.io/DOM-Parsing/#serializing-an-element-s-attributes
https://w3c.github.io/DOM-Parsing/#xml-serializing-a-text-node

@karfau
Copy link
Member

karfau commented Apr 5, 2022

Using the master branch this issue should be resolved now.
Please let me know if it is not.
I will release 0.8.2 soon.

@SheetJSDev
Copy link

Looks good:

$ node -pe 'const { XMLSerializer, DOMParser } = require("@xmldom/xmldom"); new XMLSerializer().serializeToString(new DOMParser().parseFromString("<foo bar=\"&gt;\"/>", "text/xml").documentElement)'
<foo bar=">"/>
$ git clone --depth=1 https://github.com/xmldom/xmldom
$ cd xmldom
$ node -pe 'const { XMLSerializer, DOMParser } = require("./"); new XMLSerializer().serializeToString(new DOMParser().parseFromString("<foo bar=\"&gt;\"/>", "text/xml").documentElement)'
<foo bar="&gt;"/>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working spec:DOM-Parsing xml:valid https://www.w3.org/TR/xml11/#dt-valid
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants