Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native XPath support in jsoup #1629

Merged
merged 5 commits into from
Sep 13, 2021
Merged

Native XPath support in jsoup #1629

merged 5 commits into from
Sep 13, 2021

Conversation

jhy
Copy link
Owner

@jhy jhy commented Sep 8, 2021

This is a first draft of adding native XPath support to jsoup. It uses the Java provided XPath parser & evaluator, so should fully support all valid XPath expressions.

As a starting point, I've added the method:

Elements el.selectXpath(String expression)

For example:

String html = "<body><div><p>One</div><div><p>Two</div><div>Three</div>";
Document doc = Jsoup.parse(html);

Elements els = doc.selectXpath("//div/p");
assertEquals(2, els.size());
assertEquals("One", els.get(0).text());
assertEquals("Two", els.get(1).text());

I'm planning on bringing this in as a beta feature in the next release (1.14.3 -- I'll add a few more tests first) and then hopefully finalizing it in a release shortly following that. I'm looking for all feedback on how to make this feature most useful, and feel jsoup native.

Particularly:

  • does the name selectXpath make sense? Had thought also about just xpath() or selectX)
  • should we have a selectXpathFirst method, like selectFirst?
  • this current implementation is namespace aware. From looking at common XPath questions, it seems that that namespaces can be a source of confusion and query difficulty. Should we make it default to namespace unaware, and add another method (?) selectXpathNS or similar with it on? Or best to leave as-is?
  • Does this match what you were expecting when you heard "jsoup has xpath support now"? What's missing? (Please provide suggested API signatures and how you'd use them)
  • Does it work for your data and queries? Can you break it?

Verified

This commit was signed with the committer’s verified signature.
jhy Jonathan Hedley
@jhy jhy changed the title Native xpath support in jsoup Native XPath support in jsoup Sep 8, 2021

Verified

This commit was signed with the committer’s verified signature.
jhy Jonathan Hedley
(I thought that widening the input types would be compatible but apparently not, so reintroducing those original methods)
@jhy jhy added the feature label Sep 8, 2021
@jhy jhy added this to the 1.14.3 milestone Sep 8, 2021
@jhy jhy linked an issue Sep 8, 2021 that may be closed by this pull request
@jhy jhy mentioned this pull request Sep 8, 2021
jhy added 3 commits September 12, 2021 14:35

Verified

This commit was signed with the committer’s verified signature.
jhy Jonathan Hedley

Verified

This commit was signed with the committer’s verified signature.
jhy Jonathan Hedley

Verified

This commit was signed with the committer’s verified signature.
jhy Jonathan Hedley
@jhy jhy merged commit c283a8d into master Sep 13, 2021
@jhy jhy deleted the xpath branch September 13, 2021 04:28
@jhy jhy self-assigned this Sep 13, 2021
@manticore-projects
Copy link

manticore-projects commented Apr 15, 2022

Greetings!
First and most important: Thank you so much for your work and providing those great tools! Much appreciated.

Now some requested feedback from practice. Parsing an XHTML file in order to insert a TOC, I got everything working somehow, but it was not straight forward and a lot of trial'n error.

The main confusion is about the XPATH. With my limited knowledge, I would have expected this to work (it works with JEdit XML/XSLT plugin):

//xhtml:a[not(@href)]/@name

But in JSoup, I had to use

//*[local-name()='a' and not(@href) and @name]

I get somehow, why I have to work around the namespaces. But I am totally lost, why this did not work

//*[local-name()='a' and not(@href)]/@name

XSoup did not work at all because of the Name Space and looking for elements without the href attribute.

@jhy
Copy link
Owner Author

jhy commented Jul 4, 2022

@manticore-projects thanks for the feedback. Can you take a look at the current release (1.15.2, you can use try.jsoup) and see if your issues are fixed? Otherwise, could you open a new issue and include sample HTML / XML, and the xpath query. I have disabled namespace awareness in the xpath evaluator which will simplify things.

@GitHubDaniel
Copy link

Works even better than xsoup. Thanks a lot. However the only thing i experienced was "Could not find function: lower-case". That's not a problem for me since [contains(attr, val) does ignore case in jsoup.

It's technically 2 errors which even out for me ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support xpath 2.0 or greater
3 participants