XMLWriter does not escape supplementary unicode characters correctly #38

abenkovskii · 2018-01-31T12:08:41Z

When the maximum allowed character is set to a positive value, an XMLWriter is supposed to encode any character with a Unicode code point higher then the maximum allowed character as a numeric character reference. However for supplementary Unicode characters the current implementation seams to generate a sequence of two invalid numeric character references instead of one valid.

To reproduce run:

import org.dom4j.io.XMLWriter;
import org.dom4j.io.OutputFormat;
import org.dom4j.tree.DefaultElement;
import java.io.ByteArrayOutputStream;
import java.io.IOException;

class XmlBugDemo {
	public static void main(String[] arg) throws IOException {
		ByteArrayOutputStream stream = new ByteArrayOutputStream();
		OutputFormat format = OutputFormat.createPrettyPrint();
		format.setEncoding("US-ASCII");
		XMLWriter writer = new XMLWriter(stream, format);

		// this string contains a single unicode code point:
		// U+1F427 PENGUIN
		String penguin = "\ud83d\udc27";
		DefaultElement foo = new DefaultElement("foo");
		foo.addText(penguin);

		writer.write(foo);
		
		System.out.println(stream.toString("US-ASCII"));
	}
}

Expected result:

<foo>&#128039;</foo>

Actual result:

<foo>&#55357;&#56359;</foo>

Notes:
The actual result isn't even a well-formed xml:

$ xmllint bad.xml 
bad.xml:2: parser error : xmlParseCharRef: invalid xmlChar value 55357
<foo>&#55357;&#56359;</foo>
             ^
bad.xml:2: parser error : xmlParseCharRef: invalid xmlChar value 56359
<foo>&#55357;&#56359;</foo>
                     ^

The text was updated successfully, but these errors were encountered:

abenkovskii · 2018-01-31T12:28:09Z

Here is why

<foo>&#55357;&#56359;</foo>

is not a well-formed xml document:

xml specification section 4.1 states:

Well-formedness constraint: Legal Character

Characters referred to using character references MUST match the production for Char.

Char is defined in xml specification section 2.2:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Both 55357 (0xD83D) and 56359 (0xDC27) are in surrogate blocks.

abenkovskii · 2018-01-31T12:34:53Z

I think this two functions are the culprit:

dom4j/src/main/java/org/dom4j/io/XMLWriter.java

Lines 1626 to 1699 in 9b14152

    
           protected String escapeElementEntities(String text) { 
        
               char[] block = null; 
        
               int i; 
        
               int last = 0; 
        
               int size = text.length(); 
        
               for (i = 0; i < size; i++) { 
        
                   String entity = null; 
        
                   char c = text.charAt(i); 
        
                   switch (c) { 
        
                       case '<': 
        
                           entity = "&lt;"; 
        
                           break; 
        
                       case '>': 
        
                           entity = "&gt;"; 
        
                           break; 
        
                       case '&': 
        
                           entity = "&amp;"; 
        
                           break; 
        
                       case '\t': 
        
                       case '\n': 
        
                       case '\r': 
        
                           // don't encode standard whitespace characters 
        
                           if (preserve) { 
        
                               entity = String.valueOf(c); 
        
                           } 
        
                           break; 
        
                       default: 
        
                           if ((c < 32) || shouldEncodeChar(c)) { 
        
                               entity = "&#" + (int) c + ";"; 
        
                           } 
        
                           break; 
        
                   } 
        
                   if (entity != null) { 
        
                       if (block == null) { 
        
                           block = text.toCharArray(); 
        
                       } 
        
                       buffer.append(block, last, i - last); 
        
                       buffer.append(entity); 
        
                       last = i + 1; 
        
                   } 
        
               } 
        
               if (last == 0) { 
        
                   return text; 
        
               } 
        
               if (last < size) { 
        
                   if (block == null) { 
        
                       block = text.toCharArray(); 
        
                   } 
        
                   buffer.append(block, last, i - last); 
        
               } 
        
               String answer = buffer.toString(); 
        
               buffer.setLength(0); 
        
               return answer; 
        
           }

dom4j/src/main/java/org/dom4j/io/XMLWriter.java

Lines 1718 to 1805 in 9b14152

    
           protected String escapeAttributeEntities(String text) { 
        
               char quote = format.getAttributeQuoteCharacter(); 
        
               char[] block = null; 
        
               int i; 
        
               int last = 0; 
        
               int size = text.length(); 
        
               for (i = 0; i < size; i++) { 
        
                   String entity = null; 
        
                   char c = text.charAt(i); 
        
                   switch (c) { 
        
                       case '<': 
        
                           entity = "&lt;"; 
        
                           break; 
        
                       case '>': 
        
                           entity = "&gt;"; 
        
                           break; 
        
                       case '\'': 
        
                           if (quote == '\'') { 
        
                               entity = "&apos;"; 
        
                           } 
        
                           break; 
        
                       case '\"': 
        
                           if (quote == '\"') { 
        
                               entity = "&quot;"; 
        
                           } 
        
                           break; 
        
                       case '&': 
        
                           entity = "&amp;"; 
        
                           break; 
        
                       case '\t': 
        
                       case '\n': 
        
                       case '\r': 
        
                           // don't encode standard whitespace characters 
        
                           break; 
        
                       default: 
        
                           if ((c < 32) || shouldEncodeChar(c)) { 
        
                               entity = "&#" + (int) c + ";"; 
        
                           } 
        
                           break; 
        
                   } 
        
                   if (entity != null) { 
        
                       if (block == null) { 
        
                           block = text.toCharArray(); 
        
                       } 
        
                       buffer.append(block, last, i - last); 
        
                       buffer.append(entity); 
        
                       last = i + 1; 
        
                   } 
        
               } 
        
               if (last == 0) { 
        
                   return text; 
        
               } 
        
               if (last < size) { 
        
                   if (block == null) { 
        
                       block = text.toCharArray(); 
        
                   } 
        
                   buffer.append(block, last, i - last); 
        
               } 
        
               String answer = buffer.toString(); 
        
               buffer.setLength(0); 
        
               return answer; 
        
           }

They encode one java char at a time rather than encoding one Unicode code point at a time.

FilipJirsak · 2018-07-01T13:37:55Z

Fixed.

(cherry picked from commit 75e59b1)

(cherry picked from commit b408f43)

FilipJirsak self-assigned this Jan 31, 2018

FilipJirsak added the bug label Jan 31, 2018

FilipJirsak modified the milestones: 2.0.3, 2.1.1 Jan 31, 2018

FilipJirsak added a commit that referenced this issue Jul 1, 2018

#38 Support for supplementary unicode characters in XMLWriter.

75e59b1

FilipJirsak closed this as completed Jul 1, 2018

FilipJirsak added a commit that referenced this issue Jul 1, 2018

Fix bug in encoding whitespaces introduced with bugfix of #38.

b408f43

FilipJirsak added a commit to FilipJirsak/dom4j that referenced this issue Mar 12, 2020

dom4j#38 Support for supplementary unicode characters in XMLWriter.

f14edcd

FilipJirsak added a commit to FilipJirsak/dom4j that referenced this issue Mar 12, 2020

Fix bug in encoding whitespaces introduced with bugfix of dom4j#38.

3852812

FilipJirsak added a commit that referenced this issue Apr 11, 2020

#38 Support for supplementary unicode characters in XMLWriter.

fc48f71

(cherry picked from commit 75e59b1)

FilipJirsak added a commit that referenced this issue Apr 11, 2020

Fix bug in encoding whitespaces introduced with bugfix of #38.

083c0e6

(cherry picked from commit b408f43)

dependabot bot mentioned this issue Mar 13, 2021

Bump dom4j from 2.0.0 to 2.0.3 andyx511/keepLearning#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XMLWriter does not escape supplementary unicode characters correctly #38

XMLWriter does not escape supplementary unicode characters correctly #38

abenkovskii commented Jan 31, 2018

abenkovskii commented Jan 31, 2018

abenkovskii commented Jan 31, 2018

FilipJirsak commented Jul 1, 2018

XMLWriter does not escape supplementary unicode characters correctly #38

XMLWriter does not escape supplementary unicode characters correctly #38

Comments

abenkovskii commented Jan 31, 2018

abenkovskii commented Jan 31, 2018

abenkovskii commented Jan 31, 2018

FilipJirsak commented Jul 1, 2018