New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More strict parsing of hostname (authority) part of URLs #43
Conversation
Came up in a couple of places: #41, #29, #38, #28. Hopefully we can fix all of these with these changes. Not done yet, still want to have domain checking for URLs with certain schemes (https) but allow everything for others. If we do that, we may be able to unify the email and plain domain parsing with the scheme one too.
Just need to think about how to handle port at the end, and optional trailing dot.
This is what I was aiming for :). It's a pretty complex function but I think it makes sense that the logic is unified. If we wanted to separate out something it would probably be the require_host = false case. Then we'd just have to duplicate the userinfo parsing.
Simplifies the code a bit, no external change.
The check was necessary before because we didn't check host names properly. Now that we reject a `@` as part of a hostname, it won't be recognized as a plain domain link anymore, so we don't need to check for emails. Makes the code nicer and faster - we'd do the email scan for every `.` trigger before.
As a nice side effect of this change, the benches have improved as well, 15% for some (comparing 9a6ce39 with 5bfb516):
|
//! authority = [ userinfo "@" ] host [ ":" port ] | ||
//! | ||
//! | ||
//! userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor detail, but you never defined pct-encoded
in this block while unreserved
and sub-delims
were defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll add that. Percent-encoding in the authority is not actually handled at the moment, but I'm not sure it's necessary. Have you seen any URLs using percent-encoding in the authority part in the wild? I think we can add it later if anyone requests it.
.take_while(|c| c.is_ascii_alphabetic()) | ||
.take(2) | ||
.count() | ||
>= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused by that. How can this be bigger than 2 if we take
2 elements from the iterator? Perhaps you also want to add some unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good eye. You're right, it can't, using >= 2
and == 2
have exactly the same result here. But I still prefer to write it this way because it communicates the intent of the code more clearly, which is that we need at least 2 letters (not exactly 2 letters). I also like it because you could take away the .take(2)
optimization and it would still be correct, whereas the == 2
variant would be incorrect without the take.
Thanks for your work and congratulations to the new release. |
Applies to emails, plain domains URLs (e.g.
example.com/foo
) and URLs with schemes where a host is expected (e.g.https
).This fixes a few problems that have been reported over time, namely:
https://www.example..com
is no longer parsed as an URL (Matches URLs with consecutive periods #41)foo@v1.1.1
is no longer parsed as an email address (Github action package versions categorized as mail address #29)https://*.example.org
is no longer parsed as an URL (Add support for skipping wildcard URLs #38)It's a tricky change and hopefully this solves some problems while not introducing too many new ones. If anything unexpectedly changed for you, please let us know!