Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMLWriter does not escape supplementary unicode characters correctly #38

Closed
abenkovskii opened this issue Jan 31, 2018 · 3 comments
Closed
Assignees
Labels
Milestone

Comments

@abenkovskii
Copy link

When the maximum allowed character is set to a positive value, an XMLWriter is supposed to encode any character with a Unicode code point higher then the maximum allowed character as a numeric character reference. However for supplementary Unicode characters the current implementation seams to generate a sequence of two invalid numeric character references instead of one valid.

To reproduce run:

import org.dom4j.io.XMLWriter;
import org.dom4j.io.OutputFormat;
import org.dom4j.tree.DefaultElement;
import java.io.ByteArrayOutputStream;
import java.io.IOException;

class XmlBugDemo {
	public static void main(String[] arg) throws IOException {
		ByteArrayOutputStream stream = new ByteArrayOutputStream();
		OutputFormat format = OutputFormat.createPrettyPrint();
		format.setEncoding("US-ASCII");
		XMLWriter writer = new XMLWriter(stream, format);

		// this string contains a single unicode code point:
		// U+1F427 PENGUIN
		String penguin = "\ud83d\udc27";
		DefaultElement foo = new DefaultElement("foo");
		foo.addText(penguin);

		writer.write(foo);
		
		System.out.println(stream.toString("US-ASCII"));
	}
}

Expected result:

<foo>&#128039;</foo>

Actual result:

<foo>&#55357;&#56359;</foo>

Notes:
The actual result isn't even a well-formed xml:

$ xmllint bad.xml 
bad.xml:2: parser error : xmlParseCharRef: invalid xmlChar value 55357
<foo>&#55357;&#56359;</foo>
             ^
bad.xml:2: parser error : xmlParseCharRef: invalid xmlChar value 56359
<foo>&#55357;&#56359;</foo>
                     ^
@abenkovskii
Copy link
Author

Here is why

<foo>&#55357;&#56359;</foo>

is not a well-formed xml document:

  1. xml specification section 4.1 states:

Well-formedness constraint: Legal Character

Characters referred to using character references MUST match the production for Char.

  1. Char is defined in xml specification section 2.2:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

  1. Both 55357 (0xD83D) and 56359 (0xDC27) are in surrogate blocks.

@abenkovskii
Copy link
Author

I think this two functions are the culprit:

protected String escapeElementEntities(String text) {
char[] block = null;
int i;
int last = 0;
int size = text.length();
for (i = 0; i < size; i++) {
String entity = null;
char c = text.charAt(i);
switch (c) {
case '<':
entity = "&lt;";
break;
case '>':
entity = "&gt;";
break;
case '&':
entity = "&amp;";
break;
case '\t':
case '\n':
case '\r':
// don't encode standard whitespace characters
if (preserve) {
entity = String.valueOf(c);
}
break;
default:
if ((c < 32) || shouldEncodeChar(c)) {
entity = "&#" + (int) c + ";";
}
break;
}
if (entity != null) {
if (block == null) {
block = text.toCharArray();
}
buffer.append(block, last, i - last);
buffer.append(entity);
last = i + 1;
}
}
if (last == 0) {
return text;
}
if (last < size) {
if (block == null) {
block = text.toCharArray();
}
buffer.append(block, last, i - last);
}
String answer = buffer.toString();
buffer.setLength(0);
return answer;
}

protected String escapeAttributeEntities(String text) {
char quote = format.getAttributeQuoteCharacter();
char[] block = null;
int i;
int last = 0;
int size = text.length();
for (i = 0; i < size; i++) {
String entity = null;
char c = text.charAt(i);
switch (c) {
case '<':
entity = "&lt;";
break;
case '>':
entity = "&gt;";
break;
case '\'':
if (quote == '\'') {
entity = "&apos;";
}
break;
case '\"':
if (quote == '\"') {
entity = "&quot;";
}
break;
case '&':
entity = "&amp;";
break;
case '\t':
case '\n':
case '\r':
// don't encode standard whitespace characters
break;
default:
if ((c < 32) || shouldEncodeChar(c)) {
entity = "&#" + (int) c + ";";
}
break;
}
if (entity != null) {
if (block == null) {
block = text.toCharArray();
}
buffer.append(block, last, i - last);
buffer.append(entity);
last = i + 1;
}
}
if (last == 0) {
return text;
}
if (last < size) {
if (block == null) {
block = text.toCharArray();
}
buffer.append(block, last, i - last);
}
String answer = buffer.toString();
buffer.setLength(0);
return answer;
}

They encode one java char at a time rather than encoding one Unicode code point at a time.

@FilipJirsak
Copy link
Contributor

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants