New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XMLWriter does not escape supplementary unicode characters correctly #38
Comments
Here is why <foo>��</foo> is not a well-formed xml document:
|
I think this two functions are the culprit: dom4j/src/main/java/org/dom4j/io/XMLWriter.java Lines 1626 to 1699 in 9b14152
dom4j/src/main/java/org/dom4j/io/XMLWriter.java Lines 1718 to 1805 in 9b14152
They encode one java char at a time rather than encoding one Unicode code point at a time. |
Fixed. |
(cherry picked from commit 75e59b1)
(cherry picked from commit b408f43)
When the maximum allowed character is set to a positive value, an XMLWriter is supposed to encode any character with a Unicode code point higher then the maximum allowed character as a numeric character reference. However for supplementary Unicode characters the current implementation seams to generate a sequence of two invalid numeric character references instead of one valid.
To reproduce run:
Expected result:
Actual result:
Notes:
The actual result isn't even a well-formed xml:
The text was updated successfully, but these errors were encountered: