Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

entity en/decoding different between parser and serializer #421

Closed
sbresin opened this issue Jul 21, 2022 · 5 comments
Closed

entity en/decoding different between parser and serializer #421

sbresin opened this issue Jul 21, 2022 · 5 comments
Labels
spec:DOM-Parsing spec:XML https://www.w3.org/TR/xml11/ wontfix This will not be worked on

Comments

@sbresin
Copy link

sbresin commented Jul 21, 2022

Description

According to the XML-Spec, <, >, & have to be encoded in attributes and text nodes.
In attributes additionally ' and " have to be encoded.

The XMLSerializer does this encoding according to the spec. (except for &apos; in attributes, which is a bug, but super easily fixed)

The parser on the other hand, decodes all 5 entities in attributes AND in text nodes.

I have to process XMLs, where all 5 entities are also encoded for text fields. Parsing, modifying and then serializing these XMLs then changes all the text nodes.

How to replicate

// test.mjs
import { DOMParser, XMLSerializer } from '@xmldom/xmldom';

const testxml =
`<?xml version="1.0" encoding="UTF-8"?>
<rootel xmlns="http://soap.sforce.com/2006/04/metadata">
    <textnode testattribute="&amp; &lt; &gt; &apos; &quot;">
      &amp;
      &lt;
      &gt;
      &apos;
      &quot;
    </textnode>
</rootel>
`;

const xmldoc = new DOMParser().parseFromString(testxml, 'text/xml');

const serializedXml = new XMLSerializer().serializeToString(xmldoc);

console.log(serializedXml);

outputs this:

<?xml version="1.0" encoding="UTF-8"?>
<rootel xmlns="http://soap.sforce.com/2006/04/metadata">
    <textnode testattribute="&amp; &lt; &gt; ' &quot;">
      &amp;
      &lt;
      &gt;
      '
      "
    </textnode>
</rootel>

Solution

I am happy to open a PR for this, but first wanted to clarify the approach:

  1. simplest one: change the serializer, to encode all entities for text and attributes
    • It's a very simple 2 lines change, but it then encodes more chars than required by the spec
  2. OR: change parser to only decode &amp;, &lt; and &gt; for text nodes (here)
    • should only limit it in XML mode, would need to stay the same for html
    • would be spec compliant
    • could be breaking for people who are used to have all 5 entities being decoded
@karfau
Copy link
Member

karfau commented Jul 22, 2022

Thank you for this awesome bug report.
We generally prefer spec compliant approaches "2." in your case.

Can you please check if the behavior is still present in the version that is upcoming as part of #338 ?
That PR will give us all the options to properly treat XML and HTML differently in all regards.

If it doesn't already solve the issue you are describing, I would ask you to either base your work on that PR (I can take care of rebasing your branch if required) or wait until it has been merged.

I didn't find much time to work on this repo recently, but I want to get back to it in the next weeks/months.

@karfau karfau added help-wanted External contributions welcome spec:XML https://www.w3.org/TR/xml11/ breaking change Some thing that requires a version bump due to breaking changes spec:DOM-Parsing labels Jul 22, 2022
@karfau karfau added this to the next breaking/minor release milestone Jul 22, 2022
@marrus-sh
Copy link

apos should NOT be encoded in attributes according to the latest XMLSerializer spec: https://w3c.github.io/DOM-Parsing/#dfn-serializing-an-attribute-value

so this is not a bug, it is expected behaviour

@marrus-sh
Copy link

marrus-sh commented Jul 31, 2022

as for “decoding entities in text nodes”, this is necessary for things like .textContent to work as expected, and it is also what happens in browsers

const doc = new DOMParser().parseFromString("<root>&amp;&lt;&gt;&apos;&quot;</root>", 'text/xml')
doc.documentElement.textContent;
// should be &<>'"
new XMLSerializer().serializeToString(doc);
// should be <root>&amp;&lt;&gt;'"</root>

you can run these in your browser console and see that this is the expected result

@sbresin
Copy link
Author

sbresin commented Aug 1, 2022

Hey @marrus-sh ,

Thanks for taking a close look. I only looked at the XML spec before .... 😅

Then I guess I'll have to change the HTML Mode to de- and encode everything that looks like a known entity and use this to modify the seemingly non compliant XMLs, that I have to deal with.

I'll check the linked spec and the behaviour in browsers, thanks for pointing it out!

@karfau
Copy link
Member

karfau commented Oct 26, 2022

@sbresin @marrus-sh do I understand correctly that the current behavior of xmldom is what is expected by the spec and also what happens in browsers and we can close this issue as wontfix?

Ps: in case we need to treat html and xml differently, we are now able to do that quite easily and reliably.

@karfau karfau added wontfix This will not be worked on and removed help-wanted External contributions welcome breaking change Some thing that requires a version bump due to breaking changes labels Jun 11, 2023
@karfau karfau closed this as completed Jun 11, 2023
@karfau karfau removed this from the next breaking/minor release milestone Jun 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:DOM-Parsing spec:XML https://www.w3.org/TR/xml11/ wontfix This will not be worked on
Projects
Status: Done
Development

No branches or pull requests

3 participants