Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow attributes valid in html when converting from JSoup to W3C Document #1647

Closed
jairamc opened this issue Sep 29, 2021 · 3 comments · Fixed by #1648
Closed

Allow attributes valid in html when converting from JSoup to W3C Document #1647

jairamc opened this issue Sep 29, 2021 · 3 comments · Fixed by #1648
Milestone

Comments

@jairamc
Copy link
Contributor

jairamc commented Sep 29, 2021

Consider the following html document:

<!DOCTYPE html>
<html>
<head></head>
<body style="color: red" " name">
  <p hành="1" hình="2">unicode attr names</p>
</body></html>

Using v1.14.2 and running the following code:

    public static void main(String[] args) {
        String html = "<!DOCTYPE html><html><head></head><body style=\"color: red\" \" name\"><p hành=\"1\" hình=\"2\">unicode attr names</p></body></html>";
        org.jsoup.nodes.Document jsoupDoc;
        jsoupDoc = Jsoup.parse(html);
        Document w3Doc = W3CDom.convert(jsoupDoc);
        System.out.println(W3CDom.asString(w3Doc, W3CDom.OutputHtml()));
    }

Results in:

<!DOCTYPE html SYSTEM "about:legacy-compat">
<html>
<head><META http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
<body name="" style="color: red">
  <p hnh="2">unicode attr names</p>
</body>
</html>

This is caused by W3CDOM.java#L346 hard-codes the syntax to xml. It can be easily fixed by checking the doctype of the output document and use that as the syntax.

@jhy
Copy link
Owner

jhy commented Oct 4, 2021

Hi -- this looks like a good change. I made a comment on the PR (#1648) on the implementation, just wanted to check you'd seen it.

@jairamc
Copy link
Contributor Author

jairamc commented Oct 5, 2021

Hi @jhy . My sincere apologies. I have seen your response. I'm at a company conference this week and hence have been struggling on time to work on your suggestions. I'll try and push an update this week, if not, next week for sure.

@jhy jhy closed this as completed in #1648 Oct 6, 2021
jhy pushed a commit that referenced this issue Oct 6, 2021
When parsing and converting an html document, the syntax was hard-coded to xml. This PR checks the document type of the output document and uses that to determine which attributes are valid.

Co-authored-by: jhy <jonathan@hedley.net>

Fixes #1647
@jhy jhy added the improvement label Oct 6, 2021
@jhy jhy added this to the 1.15.1 milestone Oct 6, 2021
@jhy
Copy link
Owner

jhy commented Oct 6, 2021

Thanks! I have merged this with a couple tweaks. And certainly nothing to apologize for, I was just checking in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants