Invalid XML characters in output with Syntax.xml #1743

jorditpuig · 2022-04-07T16:50:26Z

Background

I might be wrong but I believe there is a regression of issue #887. Most probably it was not really solved in the first place. Note that in #1556, which was solved with the same PR, the XML document is in version 1.1. XML 1.1 does permit escaped characters in the range [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] . XML 1.0 on the other hand, does not permit ASCII control characters like . An XML parser should raise an error (and most of them actually do) if those characters appear in a document with an XML declaration of version 1.0. This holds both if the character is included in binary (e.g. "\u000c") or as an escaped sequence (e.g. ""). Furthermore, if I a am not mistaken, XHTML is a reformulation of XML version 1.0 [sic] and not 1.1. Finally, the w3c validator raises an error in all the variants of XHTML if the input includes the escape sequence .

If this is considered out of the scope of jsoup that's fine but developers should know that the output might not be a valid XML document even when OutputSettings.syntax is Syntax.xml. Another option would be to discard those characters when Syntax.xml is used. To be honest it's not clear if Syntax.xml refers to XML version 1.0 or 1.1. In the code of Entities.java though there is a reference to the spec of XML 1.0. Maybe the name of the constant should have been Syntax.xhtml and not Syntax.xml. Nevertheless, if it were to choose between the two versions of XML most probably it is more coherent to go for 1.0 since jsoup is an (X)HTML parser not an XML one.

I can help prepare a PR if someone can decide what the expected behaviour should be.

Desired behaviour

@Test
public void invalidCharactersDiscardedInXml() {
    String invalid = "AAA&#xc;BBB\fCCC";
    Document doc = Jsoup.parseBodyFragment(invalid);
    OutputSettings settings = new OutputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).prettyPrint(false);
    String cleaned = doc.outputSettings(settings).toString();
    Assert.assertFalse(cleaned.contains("\f"));
    Assert.assertFalse(cleaned.contains("&#xc;"));
    Assert.assertTrue(cleaned.matches("AAA\\ *BBB\\ *CCC"));
}

A subtle detail in the test above is that if prettyPrint is true  will be replaced with a space but other control characters forbidden in XML documents (e.g. ) will not. The underlying problem is not really solved if pretty printing is enabled.

Observed behaviour

@Test
public void invalidCharactersNotDiscardedInXml() {
    String invalid = "AAA&#xc;BBB\fCCC";
    Document doc = Jsoup.parseBodyFragment(invalid);
    OutputSettings settings = new OutputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).prettyPrint(false);
    String cleaned = doc.outputSettings(settings).body().html();
    Assert.assertEquals(cleaned, "AAA&#xc;BBB&#xc;CCC");
}

The text was updated successfully, but these errors were encountered:

jhy · 2022-05-17T08:05:49Z

Thanks for the review. For max compatibility / least surprise within the existing solution, my inclination is that we should align this on XML 1.0 and the rules around it. The EscapeMode xhtml can be the owner of that rule. If we later want to add XML 1.1, we could perhaps make a new escape mode xml1.1 or similar.

The Syntax.xml is not supposed to be tied to HTML / XHTML but just be XML. And not include any of the e.g. HTML ellisions or specific serialization rules.

Potentially we could infer the escape mode to use by the setting of the syntax, or the meta-data in the document, with the ability to override.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid XML characters in output with Syntax.xml #1743

Invalid XML characters in output with Syntax.xml #1743

jorditpuig commented Apr 7, 2022

jhy commented May 17, 2022

Invalid XML characters in output with Syntax.xml #1743

Invalid XML characters in output with Syntax.xml #1743

Comments

jorditpuig commented Apr 7, 2022

Background

Desired behaviour

Observed behaviour

jhy commented May 17, 2022