entity en/decoding different between parser and serializer #421

sbresin · 2022-07-21T15:18:01Z

Description

According to the XML-Spec, <, >, & have to be encoded in attributes and text nodes.
In attributes additionally ' and " have to be encoded.

The XMLSerializer does this encoding according to the spec. (except for ' in attributes, which is a bug, but super easily fixed)

The parser on the other hand, decodes all 5 entities in attributes AND in text nodes.

I have to process XMLs, where all 5 entities are also encoded for text fields. Parsing, modifying and then serializing these XMLs then changes all the text nodes.

How to replicate

// test.mjs
import { DOMParser, XMLSerializer } from '@xmldom/xmldom';

const testxml =
`<?xml version="1.0" encoding="UTF-8"?>
<rootel xmlns="http://soap.sforce.com/2006/04/metadata">
    <textnode testattribute="&amp; &lt; &gt; &apos; &quot;">
      &amp;
      &lt;
      &gt;
      &apos;
      &quot;
    </textnode>
</rootel>
`;

const xmldoc = new DOMParser().parseFromString(testxml, 'text/xml');

const serializedXml = new XMLSerializer().serializeToString(xmldoc);

console.log(serializedXml);

outputs this:

<?xml version="1.0" encoding="UTF-8"?>
<rootel xmlns="http://soap.sforce.com/2006/04/metadata">
    <textnode testattribute="&amp; &lt; &gt; ' &quot;">
      &amp;
      &lt;
      &gt;
      '
      "
    </textnode>
</rootel>

Solution

I am happy to open a PR for this, but first wanted to clarify the approach:

simplest one: change the serializer, to encode all entities for text and attributes
- It's a very simple 2 lines change, but it then encodes more chars than required by the spec
OR: change parser to only decode &, < and > for text nodes (here)
- should only limit it in XML mode, would need to stay the same for html
- would be spec compliant
- could be breaking for people who are used to have all 5 entities being decoded

The text was updated successfully, but these errors were encountered:

karfau · 2022-07-22T13:10:50Z

Thank you for this awesome bug report.
We generally prefer spec compliant approaches "2." in your case.

Can you please check if the behavior is still present in the version that is upcoming as part of #338 ?
That PR will give us all the options to properly treat XML and HTML differently in all regards.

If it doesn't already solve the issue you are describing, I would ask you to either base your work on that PR (I can take care of rebasing your branch if required) or wait until it has been merged.

I didn't find much time to work on this repo recently, but I want to get back to it in the next weeks/months.

marrus-sh · 2022-07-31T22:24:27Z

apos should NOT be encoded in attributes according to the latest XMLSerializer spec: https://w3c.github.io/DOM-Parsing/#dfn-serializing-an-attribute-value

so this is not a bug, it is expected behaviour

marrus-sh · 2022-07-31T22:30:42Z

as for “decoding entities in text nodes”, this is necessary for things like .textContent to work as expected, and it is also what happens in browsers

const doc = new DOMParser().parseFromString("<root>&amp;&lt;&gt;&apos;&quot;</root>", 'text/xml')
doc.documentElement.textContent;
// should be &<>'"
new XMLSerializer().serializeToString(doc);
// should be <root>&amp;&lt;&gt;'"</root>

you can run these in your browser console and see that this is the expected result

sbresin · 2022-08-01T10:28:18Z

Hey @marrus-sh ,

Thanks for taking a close look. I only looked at the XML spec before .... 😅

Then I guess I'll have to change the HTML Mode to de- and encode everything that looks like a known entity and use this to modify the seemingly non compliant XMLs, that I have to deal with.

I'll check the linked spec and the behaviour in browsers, thanks for pointing it out!

karfau · 2022-10-26T10:36:04Z

@sbresin @marrus-sh do I understand correctly that the current behavior of xmldom is what is expected by the spec and also what happens in browsers and we can close this issue as wontfix?

Ps: in case we need to treat html and xml differently, we are now able to do that quite easily and reliably.

karfau added help-wanted External contributions welcome spec:XML https://www.w3.org/TR/xml11/ breaking change Some thing that requires a version bump due to breaking changes spec:DOM-Parsing labels Jul 22, 2022

karfau added this to the next breaking/minor release milestone Jul 22, 2022

karfau added wontfix This will not be worked on and removed help-wanted External contributions welcome breaking change Some thing that requires a version bump due to breaking changes labels Jun 11, 2023

karfau closed this as completed Jun 11, 2023

karfau removed this from the next breaking/minor release milestone Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

entity en/decoding different between parser and serializer #421

entity en/decoding different between parser and serializer #421

sbresin commented Jul 21, 2022

karfau commented Jul 22, 2022

marrus-sh commented Jul 31, 2022

marrus-sh commented Jul 31, 2022 •

edited

sbresin commented Aug 1, 2022

karfau commented Oct 26, 2022 •

edited

entity en/decoding different between parser and serializer #421

entity en/decoding different between parser and serializer #421

Comments

sbresin commented Jul 21, 2022

Description

How to replicate

Solution

karfau commented Jul 22, 2022

marrus-sh commented Jul 31, 2022

marrus-sh commented Jul 31, 2022 • edited

sbresin commented Aug 1, 2022

karfau commented Oct 26, 2022 • edited

marrus-sh commented Jul 31, 2022 •

edited

karfau commented Oct 26, 2022 •

edited