Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative source range start/end positions for first text node #2106

Open
KennyWongPFPT opened this issue Jan 19, 2024 · 1 comment
Open

Negative source range start/end positions for first text node #2106

KennyWongPFPT opened this issue Jan 19, 2024 · 1 comment

Comments

@KennyWongPFPT
Copy link

Hello,

import org.jsoup.nodes.*;
import org.jsoup.parser.*;
import org.jsoup.select.*;

public class Test {
    public static void main(String[] args) {
        HtmlTreeBuilder treeBuilder = new HtmlTreeBuilder();
        Parser parser = new Parser(treeBuilder);
        parser.setTrackPosition(true);
        Document document = parser.parseInput("foo<p></p>bar<p></p><div><b>baz</b></div>", "");
        NodeTraversor.traverse((Node node, int depth) -> {
            if (node instanceof TextNode textNode) {
                Range sourceRange = textNode.sourceRange();
                System.out.printf("text=%s start=%d end=%d%n",
                    textNode.text(),
                    sourceRange.start().pos(),
                    sourceRange.end().pos());
            }
        }, document);
    }
}

We seeing -ve start/end positions for the source range of the first text node foo, for example using release 1.16.1:

java -cp ~/.m2/repository/org/jsoup/jsoup/1.16.1/jsoup-1.16.1.jar Test.java
text=foo start=-1 end=-1
text=bar start=10 end=13
text=baz start=28 end=31

Release 1.17.2 has the end position correct, but the start is still -1

java -cp ~/.m2/repository/org/jsoup/jsoup/1.17.2/jsoup-1.17.2.jar Test.java
text=foo start=-1 end=3
text=bar start=10 end=13
text=baz start=28 end=31
@MasterChiefNemo
Copy link

MasterChiefNemo commented Feb 6, 2024

@KennyWongPFPT Is it possible that in Parser.java, the following might be causing the issue?

public static Document parseBodyFragment(String bodyHtml, String baseUri) {
Document doc = Document.createShell(baseUri);
Element body = doc.body();
List nodeList = parseFragment(bodyHtml, body, baseUri);
Node[] nodes = nodeList.toArray(new Node[0]); // the node list gets modified when re-parented

    for (int i = nodes.length - 1; i > 0; i--) {
         nodes[i].remove();
    }
    for (Node node : nodes) {
        body.appendChild(node);
    }
    return doc;}

I'm not trying to take a wild stab in the dark, but the HTML string you're passing doesn't contain an initial tag, so potentially setting the start to -1. If there's a check in place, I'm wondering if this will rectify the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants