Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSoup Document.toString() does not generate correct XML-Output #1556

Closed
Wallenstein61 opened this issue May 27, 2021 · 3 comments
Closed

JSoup Document.toString() does not generate correct XML-Output #1556

Wallenstein61 opened this issue May 27, 2021 · 3 comments
Assignees
Labels
bug Confirmed bug that we should fix
Milestone

Comments

@Wallenstein61
Copy link

We tried to configure JSoup as an XML parser and unparser. However JSoup does not seem to generate a valid output from an XML containing the escaped entity 

Find below a demonstration of the problem.

package at.ac.uibk.jsoup.tests;

import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.jsoup.Jsoup;
import org.jsoup.parser.Parser;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class JSoupTest {

	final static String originalXML = "<?xml version=\"1.1\" encoding=\"UTF-8\"?>\r\n"
			+ "<SomeText>This is an escaped escape-character: &#x1b;</SomeText>";

	public static void main(String[] args)
			throws SAXException, IOException, ParserConfigurationException, TransformerException {

		parseXMLWithJSoup();

		parseXMLInternal();

	}

	private static void parseXMLWithJSoup() {
		System.out.println();
		System.out.println("------------------- incorrect unparsing with JSOUP ------------------- ");
		System.out.println();
		System.out.println("original XML with escaped escape-character:\n  " + originalXML);
		org.jsoup.nodes.Document document = Jsoup.parse(originalXML, "", Parser.xmlParser());
		document.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml).indentAmount(2)
				.prettyPrint(true);

		String returnedXMLFromJSoupParser = document.toString();
		System.out.println();
		System.out.println("returned XMLFromJSoupParser No escaped escape character: \n  " + returnedXMLFromJSoupParser);

		org.jsoup.nodes.Document document2 = Jsoup.parse(returnedXMLFromJSoupParser, "", Parser.xmlParser());
		document2.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml).indentAmount(2)
				.prettyPrint(true);

		String returned2ndXMLFromJSoupParser = document.toString();
		System.out.println();
		System.out.println("returned reparsed result XMLFromJSoupParser: " + returned2ndXMLFromJSoupParser);
	}

	public static void parseXMLInternal()
			throws SAXException, IOException, ParserConfigurationException, TransformerException {
		System.out.println();
		System.out.println("----------------------- correct unparsing ---------------------- ");
		System.out.println();
		System.out.println("original XML with escaped escape-character:\n  " + originalXML);
		DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();

		InputSource input = new InputSource(new StringReader(originalXML));
		Document doc = builder.parse(input);

		TransformerFactory tf = TransformerFactory.newInstance();
		Transformer trans = tf.newTransformer();
		StringWriter sw = new StringWriter();
		trans.transform(new DOMSource(doc), new StreamResult(sw));

		String returnedXMLFromSaxParser = sw.toString();

		System.out.println();
		System.out.println("returned XML From SAX Parser/Transformer: \n" + returnedXMLFromSaxParser);

	}
}

Best regards
Michael

@jhy
Copy link
Owner

jhy commented Jul 6, 2021

What's not correct about it though? Char#27 ESC is valid ASCII and so shouldn't need to be escaped. Right?

@jhy jhy added the needs-more-info More information is needed from the reporter to progress the issue label Jul 6, 2021
@Wallenstein61
Copy link
Author

Wallenstein61 commented Jul 7, 2021

Hello jhy,

Not sure for HTML, however for XML a binary ESC-Character #x1B is not a valid character (see https://www.w3.org/TR/xml/#charsets).

JSoup parses <?xml version=\"1.1\" encoding=\"UTF-8\"?><SomeText>This is an escaped escape-character: &#x1b;</SomeText> correctly.

However unparsing the parsed result returns <?xml version=\"1.1\" encoding=\"UTF-8\"?><SomeText>This is an escaped escape-character: ???</SomeText>
where ??? stands for a binary esc.

Other parsers (e.g. behind a web service) may refuse to parse this again :-(

Thus it would be nice, if JSoup Document.toString() would return valid XML. Simply escape everything < #x20 :-) (except lf, nl, ...)

Michael

@jhy jhy closed this as completed in 2a4c9de Aug 12, 2021
@jhy jhy self-assigned this Aug 12, 2021
@jhy jhy added bug Confirmed bug that we should fix and removed needs-more-info More information is needed from the reporter to progress the issue labels Aug 12, 2021
@jhy jhy added this to the 1.14.2 milestone Aug 12, 2021
@jhy
Copy link
Owner

jhy commented Aug 12, 2021

Thanks, fixed! I implemented the same escapes for both XML (required) and for HTML where it's optional, but I think will be easier to read and less surprising to escape these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix
Projects
None yet
Development

No branches or pull requests

2 participants