You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are certain unicode characters which are prohibited by the XML spec. I've written the following method which should strip remove these characters from a document
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><htmlxmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head></head><body><table><tbody><tr><td>Field Value</td><td>before 	  after</td></tr></tbody></table></body></html>
The XML entities representing illegal unicode character are removed and the resulting document can be parsed by an XML parser. However, if I add 
<td>before 	  after</td>
then the String returned by cleanHtml throws the following exception when parsed as XML
org.xml.sax.SAXParseException; lineNumber: 10; columnNumber: 17; An invalid XML character (Unicode: 0xb) was found in the element content of the document.
The text was updated successfully, but these errors were encountered:
lexamxu
pushed a commit
to lexamxu/jsoup
that referenced
this issue
May 8, 2021
I have tried to remove illegal XML characters in html() method. Before returning the string of html, it will first check the syntax of it and decide whether illegal characters should be removed.
There are certain unicode characters which are prohibited by the XML spec. I've written the following method which should strip remove these characters from a document
If I test this using the following HTML input
The XML entities representing illegal unicode character are removed and the resulting document can be parsed by an XML parser. However, if I add

then the String returned by
cleanHtml
throws the following exception when parsed as XMLThe text was updated successfully, but these errors were encountered: