Some entities in text content are not converted/escaped when serializing #58

cburatto · 2020-06-24T18:37:10Z

I have noticed that node values/strings containing > are not converted to > when serializing the DOM. Is this due to any flag I should be using?

The text was updated successfully, but these errors were encountered:

karfau · 2020-06-24T20:22:06Z

Is the second > in your question written as >? (Surround it with back ticks to prevent github from rendering it as >.)
If so:
I'm not aware of any "flags" that would change behavior of handling > in those cases.
I'm aware that xmldom is not doing that currently but (because I'm working on restoring the tests) I'm confident that the parsing is still correctly done.

I think having something that looks like a closing tag bracket in text or values is "uncritical" for XML parsers, which can not be said about < (these need to be converted for the parsing to work).

I have also seen that other parsers (at least one of the abandoned domjs or libxmljs) treat those things in a more consistent manner.

If we would decide to change that (after having a stable test suite running on every change), even if it's not really a difference (from my current understanding) it might be considered a breaking change.

So my guess is such a change might not come very soon.

cburatto · 2020-06-24T20:51:43Z

Thanks -- I do mean a tag bracket character. For example, if a node value contains HTML text where entities have been converted. You load it with the DOMParser, then when you XMLSerialize it the entities are reconverted, except the closing bracket >

Example:
<xmlelement><b>This is bold</b></xmlelement>

gets serialized as
<xmlelement><b>This is bold</b></xmlelement>

karfau · 2020-06-25T11:38:28Z

Yes. That's what I was also referring to. But another short search in the old repo shows that there are already multiple issues around it there and also PRs.
I linked the "main thread" above.

sarod · 2020-12-11T14:43:15Z

According to XML spec https://www.w3.org/TR/xml/#NT-CharData
CharData cannot contain the string "]]>" which is reserved to mark the end of a CDATA section.

Quote from spec

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".

So > needs to be serialized as > at least when following ]] so that ]]> is serialized as ]]> to avoid this issue

karfau · 2021-01-19T04:38:02Z

@sarod Do I understand you correctly that you agree to a "subset" of the original question, namely for the case of ]]> that needs to be converted to ]]%gt;?

Do you think we should just always convert it, even thought he specification you are quoting also says (emphasis mine)

The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

?

I think I'm convinced that we should take care of the ]]> case.
But for anything else I would love to see a failing test (either in a comment or as a PR or as link to any repo).

to adhere to the XML specification https://www.w3.org/TR/xml/#NT-CharData xmldom#58 (comment)

karfau · 2021-01-21T02:53:52Z

The only thing left to do here is to add (non standard) options to the serializer / toString methods to configure which characters to encode or not encode in different contexts. Which might need some discussions on how to implement it in a flexible and non breaking/disruptive manner.

But I don't consider this a very important feature compared to other topics.
(PRs with tests are of course welcome.)

karfau · 2021-01-21T03:46:40Z

There is some more related information in #22 which is the older one but I will still mark that one duplicate.

sarod · 2021-01-21T20:57:45Z

Sorry for the late answer

@sarod Do I understand you correctly that you agree to a "subset" of the original question, namely for the case of ]]> that needs to be converted to ]]%gt;?

Yes.

Do you think we should just always convert it?
The way I understand the xml spec is that outside of sequence ']]>' in a cdata content both '>,' and '>' are valid.

Limiting the conversion to the ']]>' case as the advantage of minimizing the changes in generated xml and so is likely to cause less breaking changes for consumers of the library. So that would be my recommendation but the choice is yours.

karfau · 2021-01-23T11:19:15Z

@sarod Thx for the clarification.
This change already landed on master: #181
and we are planning the next release https://github.com/xmldom/xmldom/milestone/3

This issue is left open just for the more general topic of controlling the general behavior of serializing entities, hence the changed title.

SheetJSDev · 2022-04-04T18:22:37Z

@karfau the spec seems to suggest that > must be encoded in XMLSerializer#serializeToString:

https://www.w3.org/TR/2016/WD-DOM-Parsing-20160517/ is the spec covering the method in question.

XMLSerializer#serializeToString: "produce an XML serialization"

step 5: Return the result of running the XML serialization algorithm

step 14: Append to markup the result of the XML serialization of node's attributes

step 9.3: The result of serializing an attribute value given attr's value attribute

step 3: Text:

"""

Otherwise, attribute value is a string. Return the value of attribute value, first replacing any occurrences of the following:

" with "
& with &
< with <
> with >

NOTE

This matches behavior present in browsers, and goes above and beyond the grammar requirement in the XML specification's AttValue production [XML10] by also replacing ">" characters.
"""

The correct parsing of the spec is that XML10 did not require > in attribute values to be escaped, but XMLSerializer#serializeToString does.

karfau · 2022-04-04T18:45:32Z

@SheetJSDev Thank you for digging deeper.

<tagname attribute_key="attribute value">text content</tagname>

But the lines you quote are talking about attribute values (where xmldom already implements it), not about text content.
But I also looked up those, and you are also right regarding >:

(It's part of this section, but requires some scrolling.)

I will update the labels accordingly.

SheetJSDev · 2022-04-04T18:51:13Z

A bunch of issues related to > link here, so the conversation is confusion. The deep dive concerns the following:

new XMLSerializer().serializeToString(new DOMParser().parseFromString('<foo bar="&gt;"/>', 'text/xml').documentElement)

In Chrome:

> new XMLSerializer().serializeToString(new DOMParser().parseFromString('<foo bar="&gt;"/>', 'text/xml').documentElement)
< '<foo bar="&gt;"/>'

In version 0.8.1:

> const { XMLSerializer, DOMParser } = require("@xmldom/xmldom")
undefined
> new XMLSerializer().serializeToString(new DOMParser().parseFromString('<foo bar="&gt;"/>', 'text/xml').documentElement)
'<foo bar=">"/>'

karfau · 2022-04-05T05:44:37Z

You are right, it looks like I confused myself with all the different threads on this issue.
So far we only took care of <, & and whitespace in attributes and the special case of ]]> in text content.

With the links you provided it makes sense to me that his is a bug and it will be fixed soon.

Thank you for insisting.

https://stackblitz.com/edit/js-xmldom58?devToolsHeight=33&file=index.js
https://w3c.github.io/DOM-Parsing/#serializing-an-element-s-attributes
https://w3c.github.io/DOM-Parsing/#xml-serializing-a-text-node

in both attributes and text content Fixes #58 https://w3c.github.io/DOM-Parsing/#xml-serializing-a-text-node https://w3c.github.io/DOM-Parsing/#serializing-an-element-s-attributes

karfau · 2022-04-05T17:06:16Z

Using the master branch this issue should be resolved now.
Please let me know if it is not.
I will release 0.8.2 soon.

SheetJSDev · 2022-04-05T17:13:28Z

Looks good:

$ node -pe 'const { XMLSerializer, DOMParser } = require("@xmldom/xmldom"); new XMLSerializer().serializeToString(new DOMParser().parseFromString("<foo bar=\"&gt;\"/>", "text/xml").documentElement)'
<foo bar=">"/>
$ git clone --depth=1 https://github.com/xmldom/xmldom
$ cd xmldom
$ node -pe 'const { XMLSerializer, DOMParser } = require("./"); new XMLSerializer().serializeToString(new DOMParser().parseFromString("<foo bar=\"&gt;\"/>", "text/xml").documentElement)'
<foo bar="&gt;"/>

karfau mentioned this issue Jun 25, 2020

Both angle braces must be escaped jindw/xmldom#164

Open

bhovhannes mentioned this issue Dec 11, 2020

Merge generates invalid xml when failure message contains "]]>" text bhovhannes/junit-report-merger#59

Closed

karfau added a commit to karfau/xmldom that referenced this issue Jan 19, 2021

fix(dom): Escape ]]> when serializing CharData

169738a

to adhere to the XML specification https://www.w3.org/TR/xml/#NT-CharData xmldom#58 (comment)

karfau mentioned this issue Jan 19, 2021

fix(dom): Escape ]]> when serializing CharData #181

Merged

karfau added enhancement spec:no standard labels Jan 21, 2021

karfau mentioned this issue Jan 21, 2021

Encoding issue when xml contains : "<x>>10</x>" #22

Closed

karfau changed the title ~~'Larger Than' (>) not converted to entity when XMLSerialized~~ Some entities in text content are not converted/escaped when serializing Jan 21, 2021

karfau added the xml:valid https://www.w3.org/TR/xml11/#dt-valid label Jan 21, 2021

karfau mentioned this issue Jan 21, 2021

Forcing full closing tags during serialization? #50

Open

karfau mentioned this issue Mar 13, 2021

'<' and '>' are not escaped in attribute values #198

Closed

SmartLayer mentioned this issue Mar 30, 2021

& parsing wrong when xmlns points to XHTML #203

Closed

cburatto mentioned this issue Jun 12, 2021

Subtemplate rendering causes Unclosed tag error due to unescaped closing tags >> open-xml-templating/docxtemplater#606

Closed

karfau removed the enhancement label Aug 24, 2021

karfau added this to the planning 1.0.0 milestone Aug 28, 2021

karfau mentioned this issue Nov 3, 2021

XML attributes escaping seems to be broken #339

Closed

karfau added bug Something isn't working spec:DOM-Parsing and removed spec:no standard labels Apr 4, 2022

karfau modified the milestones: planning 1.0.0, before 1.0.0 Apr 4, 2022

karfau added a commit that referenced this issue Apr 5, 2022

fix(dom): Serialize > as specified

b79be23

in both attributes and text content Fixes #58 https://w3c.github.io/DOM-Parsing/#xml-serializing-a-text-node https://w3c.github.io/DOM-Parsing/#serializing-an-element-s-attributes

karfau mentioned this issue Apr 5, 2022

fix(dom): Serialize > as specified #395

Merged

karfau modified the milestones: before 1.0.0, 0.8.2 Apr 5, 2022

karfau closed this as completed in #395 Apr 5, 2022

karfau added a commit that referenced this issue Apr 5, 2022

fix(dom): Serialize > as specified (#395)

c234c4d

in both attributes and text content Fixes #58 https://w3c.github.io/DOM-Parsing/#xml-serializing-a-text-node https://w3c.github.io/DOM-Parsing/#serializing-an-element-s-attributes

PanierAvide mentioned this issue Nov 3, 2022

Update xmldom dependency dtc-innovation/anonymisation-document-budgetaire#55

Open

theschitz mentioned this issue Nov 11, 2022

build(deps): bump @xmldom/xmldom from 0.7.5 to 0.7.6 in /extension jwikman/nab-al-tools#413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some entities in text content are not converted/escaped when serializing #58

Some entities in text content are not converted/escaped when serializing #58

cburatto commented Jun 24, 2020 •

edited by karfau

karfau commented Jun 24, 2020 •

edited

cburatto commented Jun 24, 2020

karfau commented Jun 25, 2020

sarod commented Dec 11, 2020 •

edited

karfau commented Jan 19, 2021

karfau commented Jan 21, 2021

karfau commented Jan 21, 2021

sarod commented Jan 21, 2021 •

edited

karfau commented Jan 23, 2021

SheetJSDev commented Apr 4, 2022

karfau commented Apr 4, 2022 •

edited

SheetJSDev commented Apr 4, 2022

karfau commented Apr 5, 2022 •

edited

karfau commented Apr 5, 2022

SheetJSDev commented Apr 5, 2022

Some entities in text content are not converted/escaped when serializing #58

Some entities in text content are not converted/escaped when serializing #58

Comments

cburatto commented Jun 24, 2020 • edited by karfau

karfau commented Jun 24, 2020 • edited

cburatto commented Jun 24, 2020

karfau commented Jun 25, 2020

sarod commented Dec 11, 2020 • edited

karfau commented Jan 19, 2021

karfau commented Jan 21, 2021

karfau commented Jan 21, 2021

sarod commented Jan 21, 2021 • edited

karfau commented Jan 23, 2021

SheetJSDev commented Apr 4, 2022

karfau commented Apr 4, 2022 • edited

SheetJSDev commented Apr 4, 2022

karfau commented Apr 5, 2022 • edited

karfau commented Apr 5, 2022

SheetJSDev commented Apr 5, 2022

cburatto commented Jun 24, 2020 •

edited by karfau

karfau commented Jun 24, 2020 •

edited

sarod commented Dec 11, 2020 •

edited

sarod commented Jan 21, 2021 •

edited

karfau commented Apr 4, 2022 •

edited

karfau commented Apr 5, 2022 •

edited