Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid parsing of processing instructions #770

Closed
chw-1 opened this issue Oct 12, 2016 · 4 comments
Closed

Invalid parsing of processing instructions #770

chw-1 opened this issue Oct 12, 2016 · 4 comments
Labels
bug Confirmed bug that we should fix fixed
Milestone

Comments

@chw-1
Copy link

chw-1 commented Oct 12, 2016

Hello,

In version 1.9.2, processing instructions are not correctly parsed any more.
Here is sample code for reproducing the issue.

package jsoupbug;

import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.parser.Parser;

public class JsoupBug {

    private static final String XML = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<?myProcessingInstruction My Processing instruction.?>";

    public static void main(String[] args) {
        Document document = Jsoup.parse(XML, "", Parser.xmlParser());
        document.outputSettings().prettyPrint(false);
        List<Node> nodes = document.childNodes();
        Node node = nodes.get(2);
        String outerHtml = node.outerHtml();
        System.out.println(outerHtml);
    }

}

When I correctly understand the spec (https://www.w3.org/TR/REC-xml/#sec-pi) spaces are valid characters for processing instructions, but Jsoup messes things up.

With version 1.9.2 this prints:
<?myprocessingInstruction my="" processing="" instruction.=""?>
However in 1.9.1 the behavior is as I would expect:
<?myProcessingInstruction My Processing instruction.?>

@jinojohnd
Copy link

Were you able to fix this issue? I can replicate the same on 1.12.1

@chw-1
Copy link
Author

chw-1 commented Jan 6, 2021

@jinojohnd No unfortunately I was not able to fix this.

@LIKP0
Copy link

LIKP0 commented May 8, 2021

I have reproduced the bug and I would like to have a try.

@jhy jhy added the bug Confirmed bug that we should fix label Jul 10, 2021
@jhy
Copy link
Owner

jhy commented Jul 10, 2021

Reviewed and agree that this is a bug. The root cause is that we treat XML processing instructions as an odd hybrid of a comment and of a tag with attributes. Sometimes we want the attributes (e.g. to understand encoding options) and other times it'd be better to treat it as an opaque string (as in this example)

Would suggest that the fix would be to treat these as boolean attributes so to emit them without the empty ="" component.

@jhy jhy added the fixed label Jul 14, 2021
@jhy jhy closed this as completed in fce241b Jul 14, 2021
@jhy jhy modified the milestone: 1.14.2 Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix fixed
Projects
None yet
Development

No branches or pull requests

4 participants