Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove illegal XML characters when converting HTML to XML #887

Closed
donalmurtagh opened this issue May 23, 2017 · 2 comments
Closed

Remove illegal XML characters when converting HTML to XML #887

donalmurtagh opened this issue May 23, 2017 · 2 comments
Labels
duplicate This is a duplicate issue or root-cause of another issue

Comments

@donalmurtagh
Copy link

donalmurtagh commented May 23, 2017

There are certain unicode characters which are prohibited by the XML spec. I've written the following method which should strip remove these characters from a document

String cleanHtml(String source) {
    Document document = Jsoup.parse(source);
    document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
    return document.html();
}

If I test this using the following HTML input

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>

<table>
    <tbody>
    <tr>
        <td>Field Value</td>
        <td>before &#9;&#10;&#12; after</td>
    </tr>
    </tbody>
</table>

</body>
</html>

The XML entities representing illegal unicode character are removed and the resulting document can be parsed by an XML parser. However, if I add &#11;

<td>before &#9;&#10;&#12;&#11; after</td>

then the String returned by cleanHtml throws the following exception when parsed as XML

org.xml.sax.SAXParseException; lineNumber: 10; columnNumber: 17; An invalid XML character (Unicode: 0xb) was found in the element content of the document.

lexamxu pushed a commit to lexamxu/jsoup that referenced this issue May 8, 2021
@lexamxu lexamxu mentioned this issue May 8, 2021
@lexamxu
Copy link

lexamxu commented May 8, 2021

I have tried to remove illegal XML characters in html() method. Before returning the string of html, it will first check the syntax of it and decide whether illegal characters should be removed.

@jhy jhy added the duplicate This is a duplicate issue or root-cause of another issue label Aug 12, 2021
@jhy
Copy link
Owner

jhy commented Aug 12, 2021

Thanks and sorry it took so long to get to this! Same issue as #1556, fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This is a duplicate issue or root-cause of another issue
Projects
None yet
Development

No branches or pull requests

3 participants