Allow attributes valid in html when converting from JSoup to W3C Document #1647

jairamc · 2021-09-29T11:29:14Z

Consider the following html document:

<!DOCTYPE html>
<html>
<head></head>
<body style="color: red" " name">
  <p hành="1" hình="2">unicode attr names</p>
</body></html>

Using v1.14.2 and running the following code:

    public static void main(String[] args) {
        String html = "<!DOCTYPE html><html><head></head><body style=\"color: red\" \" name\"><p hành=\"1\" hình=\"2\">unicode attr names</p></body></html>";
        org.jsoup.nodes.Document jsoupDoc;
        jsoupDoc = Jsoup.parse(html);
        Document w3Doc = W3CDom.convert(jsoupDoc);
        System.out.println(W3CDom.asString(w3Doc, W3CDom.OutputHtml()));
    }

Results in:

<!DOCTYPE html SYSTEM "about:legacy-compat">
<html>
<head><META http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
<body name="" style="color: red">
  <p hnh="2">unicode attr names</p>
</body>
</html>

This is caused by W3CDOM.java#L346 hard-codes the syntax to xml. It can be easily fixed by checking the doctype of the output document and use that as the syntax.

The text was updated successfully, but these errors were encountered:

jhy · 2021-10-04T22:49:21Z

Hi -- this looks like a good change. I made a comment on the PR (#1648) on the implementation, just wanted to check you'd seen it.

jairamc · 2021-10-05T12:45:02Z

Hi @jhy . My sincere apologies. I have seen your response. I'm at a company conference this week and hence have been struggling on time to work on your suggestions. I'll try and push an update this week, if not, next week for sure.

When parsing and converting an html document, the syntax was hard-coded to xml. This PR checks the document type of the output document and uses that to determine which attributes are valid. Co-authored-by: jhy <jonathan@hedley.net> Fixes #1647

jhy · 2021-10-06T11:02:07Z

Thanks! I have merged this with a couple tweaks. And certainly nothing to apologize for, I was just checking in.

jairamc mentioned this issue Sep 29, 2021

Allow attributes valid in html when converting #1648

Merged

jhy linked a pull request Sep 30, 2021 that will close this issue

Allow attributes valid in html when converting #1648

Merged

jhy closed this as completed in #1648 Oct 6, 2021

jhy added the improvement label Oct 6, 2021

jhy added this to the 1.15.1 milestone Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow attributes valid in html when converting from JSoup to W3C Document #1647

Allow attributes valid in html when converting from JSoup to W3C Document #1647

jairamc commented Sep 29, 2021

jhy commented Oct 4, 2021

jairamc commented Oct 5, 2021

jhy commented Oct 6, 2021

Allow attributes valid in html when converting from JSoup to W3C Document #1647

Allow attributes valid in html when converting from JSoup to W3C Document #1647

Comments

jairamc commented Sep 29, 2021

jhy commented Oct 4, 2021

jairamc commented Oct 5, 2021

jhy commented Oct 6, 2021