Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

:valid and :invalid support #34

Open
facelessuser opened this issue Dec 22, 2018 · 3 comments
Open

:valid and :invalid support #34

facelessuser opened this issue Dec 22, 2018 · 3 comments
Labels
C: css-level-4 CSS level 4 selectors. P: maybe Pending approval of low priority request. T: feature Feature.

Comments

@facelessuser
Copy link
Owner

facelessuser commented Dec 22, 2018

Support for this would require quite a bit of work. We would need to write proper validators for each kind of input type. I am not sure when this will get done, but it is large enough to be a case unto itself.

This would include :in-range and :out-of-range, though as :in-range and :out-of-range is more simple, it is possible that could get implemented first. Moved to a separate issue.

@facelessuser facelessuser added T: feature Feature. selectors C: css-level-4 CSS level 4 selectors. P: maybe Pending approval of low priority request. labels Dec 30, 2018
@facelessuser
Copy link
Owner Author

facelessuser commented Jan 4, 2019

Some thoughts on this, and why I've added the maybe label.

In general, there is a lot of work to do here, but none of it is impossible, but some is an excessive amount of work.

Things like validating emails is easy. Validating URLs is a bit more, validating patterns is a ton.

Why is validating patterns a ton of work? Well, HTML basically uses the JavaScript regular expression engine to evaluate the patterns. We use Python. Python Re != JavaScript Regexp. JavaScript adds \cXX escapes. \u{xxxx} escapes, it doesn't have lookbehinds etc.

So how do you get over this hurdle?

  1. Option 1 would be to preprocess the patterns invalidating the patterns if it contains unsupported JS regexp syntax, escape anything unsupported by JS regexp that Python would trigger on, and translate things like \cXX and \u{xxxx} to Python Re equivalent sytnax. I've done similar things in https://github.com/facelessuser/backrefs. The work wouldn't be as big as was in backrefs as I would not need to have Unicode properties implemented, just exclude certain syntax via failure or escaping, and translate a few other syntax tokens.

  2. Option 2, require some library (optionally) that provides bindings to something like V8 JavaScript engine to tap into a JavaScript regexp library.

Anyways, it's a lot of work. Some validation (like the work done for :in-range and :out-of-range) is easy, some is quite involved. The question is whether the payoff is worth the work required. All is doable, and all well within my skill set to implement, but we'll have to see if the motivation/payoff ratios align with the work load required.

@facelessuser
Copy link
Owner Author

JavaScript does allow look behinds. It didn't used to, but now it does in some browsers.

If this was going to be done with Python, it would have to be done with the regex library as the re library doesn't support variable width look behinds, but JavaScript does.

@facelessuser
Copy link
Owner Author

After doing some work to address a bug in :placeholder-shown, it has come to my attention that browsers often do a bit of normalization before they compare things like length of a value. Things like carriage returns and such in a real browser environment may actually get normalized if they are raw and maybe not if they are inserted as an entity. I guess I kind of already knew this, but not something I really put much thought into until I started having to deal with that fact directly by coding logic around it.

Soup Sieve doesn't control such things, this is all handled by the Beautiful Soup and the parsers. By time Soup Sieve gets to look at the content in an input, it is has already had entities turned into Unicode characters and other characters normalized (html5lib) or not normalized (html.parser and lxml). It may be difficult to mimic exactly what a browser would do in all cases due to this fact. html5lib is probably the most likely be the closest in terms of how characters are handled. This is assuming it is doing what browsers do, and not some generalized approximation.

In some implementations, we may just have to accept we that we can only approximate how some selectors work based on the limitations of the environment.

@gir-bot gir-bot removed the selectors label Nov 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: css-level-4 CSS level 4 selectors. P: maybe Pending approval of low priority request. T: feature Feature.
Projects
None yet
Development

No branches or pull requests

2 participants