Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML attributes escaping seems to be broken #339

Closed
mykola-mokhnach opened this issue Nov 3, 2021 · 5 comments
Closed

XML attributes escaping seems to be broken #339

mykola-mokhnach opened this issue Nov 3, 2021 · 5 comments
Labels
duplicate This issue or pull request already exists

Comments

@mykola-mokhnach
Copy link

It looks like attributes parsing has been broken in some previous xmldom versions. I'm not sure which one, but here are the details:

import xmldom from '@xmldom/xmldom';

let srcTree = `<?xml version="1.0" encoding="UTF-8"?>
  <element attribute="&lt;&gt;&quot;&amp;"/>
`;

let parser = new xmldom.DOMParser();

let tree = parser.parseFromString(srcTree);

let doc = parser.parseFromString(`<?xml version="1.0" encoding="UTF-8"?><root/>`);
doc.documentElement.appendChild(tree.documentElement);

console.log(new xmldom.XMLSerializer().serializeToString(doc));

The expected output here would be <?xml version="1.0" encoding="UTF-8"?><root><element attribute="&lt;&gt;&quot;&amp;"/></root>, but the current recent module version (0.7.5) outputs
<?xml version="1.0" encoding="UTF-8"?><root><element attribute="&lt;>&quot;&amp;"/></root>, e.g.
the > character does not get escaped properly while it is prohibited to use in attribute values according to the XML spec.

@mykola-mokhnach
Copy link
Author

Probably related to #198

@karfau karfau added the duplicate This issue or pull request already exists label Nov 3, 2021
@karfau
Copy link
Member

karfau commented Nov 3, 2021

Thank you for reporting it, but this is a duplicate of #58.

There are some issues with your report though:

  1. I checked past versions of xmldom with your code, including all deprecated versions that the new maintainers feel responsible for: https://stackblitz.com/edit/js-saacme?file=index.js
    (to see the output you need to expand the console on the bottom right)
    And didn't find a single version in which xmldom serialized > as &gt;.

Just to be extra sure, I also additionally log the value of the attribute after parsing which of course always contains the > as expected, so it's a pure serialization issue, not a parsing issue.

  1. I'm not aware of any XML spec that forbids > in attribute values. As I said before:

Just some details about why I didn't also add escaping for >:

According to the specs I linked in the PR #199, the < not being part of attribute values is a well-formedness constraint (so it's not about validity of the XML).
But this is not true for > in attribute values, or at least I couldn't find anything about it.

Here are those links again:

If you know of any other spec that indicates the limitation you are describing, please provide a link to it.

@karfau karfau closed this as completed Nov 3, 2021
@mykola-mokhnach
Copy link
Author

mykola-mokhnach commented Nov 4, 2021

Thank you for the quick response @karfau

What I could read in the specification that you've attached is:

The right angle bracket (>) may be represented using the string "&gt;", and must, for compatibility, be escaped using either "&gt;" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.

xmllib2 and other serializers are doing that escaping for ">" character (if you know any lib that does not then please let me know). I was just wondering then what is the reason this library should be different from them?

@karfau
Copy link
Member

karfau commented Nov 4, 2021

Here is how I understand the specs, piece by piece:

The right angle bracket (>) may be represented using the string "&gt;",

not a must, but allowed

and must, for compatibility, be escaped using either "&gt;" or a character reference when it appears in the string "]]>" in content, [and] when that string is not marking the end of a CDATA section.

(empahsis mine) Only when it is part of content and preceded by ]] and not marking the end of a CDATA section it has to be escaped.
This part is taken care of by xmldom since 0.5.0.

Why is xmldom not doing it like other libaries?
Because "it" didn't have a high enough priority for us so far.
And "it" not just being to convert > in the different places, but in general taking care of "How and to what degree can users of xmldom control what characters should be escaped during serialization?" => If you want to contribute to that discussion (beyond talking about >), please add your ideas/comments to #58

@mykola-mokhnach
Copy link
Author

Sure, thanks for the detailed explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants