Skip to content

Attribute selectors vs \n in values #233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zverok opened this issue Nov 10, 2021 · 5 comments · Fixed by #234
Closed

Attribute selectors vs \n in values #233

zverok opened this issue Nov 10, 2021 · 5 comments · Fixed by #234
Labels
S: confirmed Confirmed bug report or approved feature request. T: bug Bug.

Comments

@zverok
Copy link

zverok commented Nov 10, 2021

Hi! Thanks for the powerful library.

I use it via BeautifulSoup, and I find out this behavior:

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><span title='foo bar'>foo1</span><span title='foo\nbar'>foo1</span></p>", 'html.parser')
print(*soup.select('span[title*="bar"]'))

I expected this to print both spans, but the actual output is

<span title="foo bar">foo1</span>

It seems that *= considers only the first line of multi-line attribute:

print(*soup.select('span[title*="foo"]'))

prints this:

<span title="foo bar">foo1</span> <span title="foo
bar">foo1</span>

Is there some bug, or some conscious limitation, or \n in attribute values is against the standard?
Thanks!

@gir-bot gir-bot added the S: triage Issue needs triage. label Nov 10, 2021
@facelessuser
Copy link
Owner

Compared to actual browser implementations, this appears to be a bug in SoupSieve: https://codepen.io/facelessuser/pen/MWvBoJm.

The reason this fails is simply due to the pattern. Our matching pattern uses .* and the DOT is not matching newlines (due to appropriate flags not being enabled).

This was simply a case (dealing with new lines) we did not specifically test. I should probably take a look at all the attribute-related patterns and compare them against browser behavior when including newlines.

@facelessuser
Copy link
Owner

@gir-bot remove S: triage
@gir-bot add T: bug, S: confirmed

@gir-bot gir-bot added S: confirmed Confirmed bug report or approved feature request. T: bug Bug. and removed S: triage Issue needs triage. labels Nov 10, 2021
@facelessuser
Copy link
Owner

Looks like it is a pretty simple fix. We just needed to enable re.DOTALL on our patterns for attribute selectors.

facelessuser added a commit that referenced this issue Nov 10, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…tr (#234)

Fixes #233
@zverok
Copy link
Author

zverok commented Nov 11, 2021

Wow. I don't know what to say. To handle such a peculiar request so fast and gracious—it is absolutely awesome 😍
Thank you!

@facelessuser
Copy link
Owner

Wow. I don't know what to say. To handle such a peculiar request so fast and gracious—it is absolutely awesome 😍
Thank you!

It's actually not so peculiar. Yes, it is a bit odd to use newlines in attributes, but Soup Sieve's goal is to match real-world CSS selector behavior, as much as is practical and possible in the scraping environment. Ideally, we'd like to limit surprises and have things operate as close as possible to what people experience using selectors in real browsers. Real-world browsers handle such cases, so we should too 🙂 .

Before I wrote Soup Sieve, BeautifulSoup's selector behavior was quite limited and very quirky. Now, you can copy in most selectors and they should work pretty much as expected meaning you don't have to think so hard about what this selector implementation supports and what it doesn't or what it does differently.

I plan on cutting a release later today, so you should be able to pick the fix up soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S: confirmed Confirmed bug report or approved feature request. T: bug Bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants