Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid XML characters in output with Syntax.xml #1743

Open
jorditpuig opened this issue Apr 7, 2022 · 1 comment
Open

Invalid XML characters in output with Syntax.xml #1743

jorditpuig opened this issue Apr 7, 2022 · 1 comment

Comments

@jorditpuig
Copy link

Background

I might be wrong but I believe there is a regression of issue #887. Most probably it was not really solved in the first place. Note that in #1556, which was solved with the same PR, the XML document is in version 1.1. XML 1.1 does permit escaped characters in the range [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] . XML 1.0 on the other hand, does not permit ASCII control characters like . An XML parser should raise an error (and most of them actually do) if those characters appear in a document with an XML declaration of version 1.0. This holds both if the character is included in binary (e.g. "\u000c") or as an escaped sequence (e.g. ""). Furthermore, if I a am not mistaken, XHTML is a reformulation of XML version 1.0 [sic] and not 1.1. Finally, the w3c validator raises an error in all the variants of XHTML if the input includes the escape sequence .

If this is considered out of the scope of jsoup that's fine but developers should know that the output might not be a valid XML document even when OutputSettings.syntax is Syntax.xml. Another option would be to discard those characters when Syntax.xml is used. To be honest it's not clear if Syntax.xml refers to XML version 1.0 or 1.1. In the code of Entities.java though there is a reference to the spec of XML 1.0. Maybe the name of the constant should have been Syntax.xhtml and not Syntax.xml. Nevertheless, if it were to choose between the two versions of XML most probably it is more coherent to go for 1.0 since jsoup is an (X)HTML parser not an XML one.

I can help prepare a PR if someone can decide what the expected behaviour should be.

Desired behaviour

@Test
public void invalidCharactersDiscardedInXml() {
    String invalid = "AAABBB\fCCC";
    Document doc = Jsoup.parseBodyFragment(invalid);
    OutputSettings settings = new OutputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).prettyPrint(false);
    String cleaned = doc.outputSettings(settings).toString();
    Assert.assertFalse(cleaned.contains("\f"));
    Assert.assertFalse(cleaned.contains(""));
    Assert.assertTrue(cleaned.matches("AAA\\ *BBB\\ *CCC"));
}

A subtle detail in the test above is that if prettyPrint is true  will be replaced with a space but other control characters forbidden in XML documents (e.g. ) will not. The underlying problem is not really solved if pretty printing is enabled.

Observed behaviour

@Test
public void invalidCharactersNotDiscardedInXml() {
    String invalid = "AAABBB\fCCC";
    Document doc = Jsoup.parseBodyFragment(invalid);
    OutputSettings settings = new OutputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).prettyPrint(false);
    String cleaned = doc.outputSettings(settings).body().html();
    Assert.assertEquals(cleaned, "AAABBBCCC");
}
@jhy
Copy link
Owner

jhy commented May 17, 2022

Thanks for the review. For max compatibility / least surprise within the existing solution, my inclination is that we should align this on XML 1.0 and the rules around it. The EscapeMode xhtml can be the owner of that rule. If we later want to add XML 1.1, we could perhaps make a new escape mode xml1.1 or similar.

The Syntax.xml is not supposed to be tied to HTML / XHTML but just be XML. And not include any of the e.g. HTML ellisions or specific serialization rules.

Potentially we could infer the escape mode to use by the setting of the syntax, or the meta-data in the document, with the ability to override.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants