Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect serialization of XML attributes containing white space #284

Closed
marrus-sh opened this issue Aug 20, 2021 · 7 comments · Fixed by #310
Closed

Incorrect serialization of XML attributes containing white space #284

marrus-sh opened this issue Aug 20, 2021 · 7 comments · Fixed by #310
Labels
breaking change Some thing that requires a version bump due to breaking changes spec:DOM-Parsing spec:XML https://www.w3.org/TR/xml11/
Milestone

Comments

@marrus-sh
Copy link

The code for serializing attributes (here:

xmldom/lib/dom.js

Lines 1138 to 1140 in 6ce4700

function addSerializedAttribute(buf, qualifiedName, value) {
buf.push(' ', qualifiedName, '="', value.replace(/[<&"]/g,_xmlEncoder), '"')
}
) does not treat newlines in any special way.

This is a problem, because XML parses literal newlines as spaces. However, it allows newlines in attributes when they are provided as character references. (See: https://www.w3.org/TR/xml/#AVNormalize.)

So:

<element attribute="with&#xA;newline"></element>

serializes as

<element attribute="with
newline"></element>

which then gets parsed as

<element attribute="with newline"></element>

which is not equivalent.

@marrus-sh
Copy link
Author

(for reference, the affected characters which need to be escaped in attributes are &#x9;, &#xA;, and &#xD;; see, for example, https://www.w3.org/TR/xml-c14n11/#ProcessingModel)

@marrus-sh
Copy link
Author

update: this is an open specification bug in DOM Parsing: w3c/DOM-Parsing#59

@karfau karfau added spec:XML https://www.w3.org/TR/xml11/ spec:DOM-Parsing help-wanted External contributions welcome needs investigation Information is missing and needs to be researched labels Aug 20, 2021
@karfau karfau added this to the after next release milestone Aug 20, 2021
@karfau
Copy link
Member

karfau commented Aug 20, 2021

@marrus-sh Thank you for reporting this, especially for finding and referencing the specification bug.

I believe that the fix needs to be applied at least to the parsing side. But maybe also to the serialization part that you linked, but I'm not sure about it.

I'm not sure how soon I will be able to address it, even though I'm very curious to see the failing tests.

Is there any workaround for this that you are aware of?

@karfau
Copy link
Member

karfau commented Aug 21, 2021

Just out of curiosity I started digging a bit.

I finally understood why you are pointing at the serializer: By checking the (Chrome and Firefox) browser implementations I found out that the value we are currently having in the DOM is correct (\n) in the example you provided, but it's not converted back to the character reference when it's serialized.

There is the following problem: Since xmldom is currently not taking care of normalizing line endings and is only taking care of replacing character references with their literal values (instead of doing everything described in https://www.w3.org/TR/xml/#AVNormalize), just converting literal characters in the DOM back to the character references will also convert literal input to character references, even though they should be converted to spaces.

I'm not sure if this is a huge issue, the snapshot tests that need to be updated when adding the fix look better then before and are closer to the "expected" values:

master...karfau:284-serialize-whitespace-literals-to-character-references

Any opinions?

@karfau karfau changed the title Error serializing XML attributes with newlines Incorrect serialization of XML attributes containing whitespace literals Aug 21, 2021
@karfau karfau added awaiting response Maintainers are waiting for information question Contains open questions that need to be answered and removed needs investigation Information is missing and needs to be researched labels Aug 21, 2021
@marrus-sh
Copy link
Author

this definitely looks closer to expected output!

i might not be understanding the code correctly, but i think if you add a .replace(/[\t\n\r]/g, " ") to

value = source.slice(start,p).replace(/&#?\w+;/g,entityReplacer);
before the entityReplacer, it should handle the attribute normalization on parsing well enough? [you want to replace literal tabs/newlines/etc with spaces, but not character references.]

@karfau karfau changed the title Incorrect serialization of XML attributes containing whitespace literals Incorrect serialization of XML attributes containing white space Aug 21, 2021
@karfau
Copy link
Member

karfau commented Aug 21, 2021

Yes, I already looked at that place in the code before, it's not the only line where this replacement took place, but I moved it into one place and added the replacement of white space literals.

I still need to figure out if this needs to be considered a breaking change, that's why the PR is only a draft for now.
I would consider it a breaking change, so I will not land it right away.

@karfau karfau added breaking change Some thing that requires a version bump due to breaking changes and removed awaiting response Maintainers are waiting for information question Contains open questions that need to be answered help-wanted External contributions welcome labels Aug 21, 2021
@marrus-sh
Copy link
Author

this looks perfect, thank you!! no worries about it not landing right away; i’m just glad it’s getting fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Some thing that requires a version bump due to breaking changes spec:DOM-Parsing spec:XML https://www.w3.org/TR/xml11/
Projects
None yet
2 participants