Remove illegal XML characters when converting HTML to XML #887

donalmurtagh · 2017-05-23T15:42:32Z

There are certain unicode characters which are prohibited by the XML spec. I've written the following method which should strip remove these characters from a document

String cleanHtml(String source) {
    Document document = Jsoup.parse(source);
    document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
    return document.html();
}

If I test this using the following HTML input

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>

<table>
    <tbody>
    <tr>
        <td>Field Value</td>
        <td>before &#9;&#10;&#12; after</td>
    </tr>
    </tbody>
</table>

</body>
</html>

The XML entities representing illegal unicode character are removed and the resulting document can be parsed by an XML parser. However, if I add 

<td>before &#9;&#10;&#12;&#11; after</td>

then the String returned by cleanHtml throws the following exception when parsed as XML

org.xml.sax.SAXParseException; lineNumber: 10; columnNumber: 17; An invalid XML character (Unicode: 0xb) was found in the element content of the document.

The text was updated successfully, but these errors were encountered:

lexamxu · 2021-05-08T12:37:42Z

I have tried to remove illegal XML characters in html() method. Before returning the string of html, it will first check the syntax of it and decide whether illegal characters should be removed.

jhy · 2021-08-12T09:14:56Z

Thanks and sorry it took so long to get to this! Same issue as #1556, fixed

lexamxu pushed a commit to lexamxu/jsoup that referenced this issue May 8, 2021

fix issue jhy#887

7f5a784

lexamxu mentioned this issue May 8, 2021

Fix Issue887 #1532

Closed

jhy added the duplicate This is a duplicate issue or root-cause of another issue label Aug 12, 2021

jhy closed this as completed Aug 12, 2021

jorditpuig mentioned this issue Apr 7, 2022

Invalid XML characters in output with Syntax.xml #1743

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove illegal XML characters when converting HTML to XML #887

Remove illegal XML characters when converting HTML to XML #887

donalmurtagh commented May 23, 2017 •

edited

lexamxu commented May 8, 2021

jhy commented Aug 12, 2021

Remove illegal XML characters when converting HTML to XML #887

Remove illegal XML characters when converting HTML to XML #887

Comments

donalmurtagh commented May 23, 2017 • edited

lexamxu commented May 8, 2021

jhy commented Aug 12, 2021

donalmurtagh commented May 23, 2017 •

edited