Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSS identifier escapes are not supported #838

Closed
hoogenbj opened this issue Feb 23, 2017 · 5 comments
Closed

CSS identifier escapes are not supported #838

hoogenbj opened this issue Feb 23, 2017 · 5 comments
Assignees
Milestone

Comments

@hoogenbj
Copy link

Hi,
when trying to do a select on a document using an id containing a hyphen, I get the following error:
Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query '#ProductSummary-/productsummary/legalEntityId': unexpected token at '/productsummary/legalEntityId'
at org.jsoup.select.QueryParser.findElements(QueryParser.java:198)
at org.jsoup.select.QueryParser.parse(QueryParser.java:65)
at org.jsoup.select.QueryParser.parse(QueryParser.java:39)
at org.jsoup.select.Selector.(Selector.java:86)
at org.jsoup.select.Selector.select(Selector.java:108)
at org.jsoup.nodes.Element.select(Element.java:296)
This seems to have been fixed before: see issue #15 .
I am using version 1.10.2.
Using getElementById() works fine, though.

@krystiangorecki
Copy link
Contributor

krystiangorecki commented Feb 23, 2017

It's not about a hyphen. Slashes cause this exception. You need a correct css selector.
Your selector is invalid because slash is being consumed as a part of id, but slashes are not allowed in ids.
https://www.w3.org/TR/CSS2/syndata.html#value-def-identifier

In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B&W?" or "B\26 W\3F".

Using getElementById() works just by chance because it doesn't check correctness of argument.

@hoogenbj
Copy link
Author

hoogenbj commented Feb 24, 2017 via email

@cketti
Copy link
Contributor

cketti commented Feb 24, 2017

Looks like the method to parse CSS identifiers is incomplete.

/**
Consume a CSS identifier (ID or class) off the queue (letter, digit, -, _)
http://www.w3.org/TR/CSS2/syndata.html#value-def-identifier
@return identifier
*/
public String consumeCssIdentifier() {
int start = pos;
while (!isEmpty() && (matchesWord() || matchesAny('-', '_')))
pos++;
return queue.substring(start, pos);
}

The linked version of the CSS specification (CSS2) contains this:

In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B&W?" or "B\26 W\3F".
Note that Unicode is code-by-code equivalent to ISO 10646 (see [UNICODE] and [ISO10646]).

So the / character needs to be escaped in CSS identifiers. Unfortunately, the current code doesn't support that.

@DulithaRanatunga
Copy link

I'm not sure if it should be its own issue or not, but the incompleteness of consumeCssIdentifier() also causes Element.cssSelector() to fail if any ancestor nodes have an escaped or unicode character. For example:

String html = "<html><body><div class=\"B\\&W\\?\"><div class=\"test\">Parsed HTML into a doc.</div></div></body></html>";
Jsoup.parse(html).select(".test").get(0).cssSelector();

Throws Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query 'div.B\&W\?': unexpected token at '\&W\?'

Because the .cssSelector just creates invalid selectors in its chain and executes them:

String classes = StringUtil.join(classNames(), ".");
if (classes.length() > 0)
selector.append('.').append(classes);

@jhy jhy changed the title jsoup again unable to parse an id with a hyphen in it CSS identifier escapes are not supported Jan 3, 2022
@jhy jhy closed this as completed in e61f688 Jan 19, 2023
@jhy jhy self-assigned this Jan 19, 2023
@jhy jhy added this to the 1.15.4 milestone Jan 19, 2023
jhy added a commit that referenced this issue Jan 19, 2023
@jhy
Copy link
Owner

jhy commented Jan 19, 2023

I'm not sure if it should be its own issue or not, but the incompleteness of consumeCssIdentifier() also causes Element.cssSelector() to fail if any ancestor nodes have an escaped or unicode character.

This works now with bc2181d and preceding commit. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants