Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More strict parsing of hostname (authority) part of URLs #43

Merged
merged 13 commits into from Jul 11, 2022
Merged

Conversation

robinst
Copy link
Owner

@robinst robinst commented Jun 27, 2022

Applies to emails, plain domains URLs (e.g. example.com/foo) and URLs with schemes where a host is expected (e.g. https).

This fixes a few problems that have been reported over time, namely:

It's a tricky change and hopefully this solves some problems while not introducing too many new ones. If anything unexpectedly changed for you, please let us know!

Came up in a couple of places: #41, #29, #38, #28. Hopefully we can fix
all of these with these changes.

Not done yet, still want to have domain checking for URLs with certain
schemes (https) but allow everything for others.

If we do that, we may be able to unify the email and plain domain
parsing with the scheme one too.
Just need to think about how to handle port at the end, and optional
trailing dot.
This is what I was aiming for :). It's a pretty complex function but I
think it makes sense that the logic is unified.

If we wanted to separate out something it would probably be the
require_host = false case. Then we'd just have to duplicate the userinfo
parsing.
Simplifies the code a bit, no external change.
The check was necessary before because we didn't check host names
properly. Now that we reject a `@` as part of a hostname, it won't be
recognized as a plain domain link anymore, so we don't need to check for
emails.

Makes the code nicer and faster - we'd do the email scan for every `.`
trigger before.
@robinst
Copy link
Owner Author

robinst commented Jun 27, 2022

As a nice side effect of this change, the benches have improved as well, 15% for some (comparing 9a6ce39 with 5bfb516):

no_links                time:   [25.885 ns 25.892 ns 25.901 ns]
                        change: [-2.2092% -2.0791% -1.9485%] (p = 0.00 < 0.05)
                        Performance has improved.

some_links              time:   [314.23 ns 314.37 ns 314.54 ns]
                        change: [-2.6413% -2.1750% -1.7947%] (p = 0.00 < 0.05)
                        Performance has improved.

heaps_of_links          time:   [1.0488 us 1.0502 us 1.0515 us]
                        change: [-14.292% -14.133% -13.997%] (p = 0.00 < 0.05)
                        Performance has improved.

some_links_without_scheme
                        time:   [389.63 ns 390.09 ns 390.54 ns]
                        change: [-15.163% -15.040% -14.905%] (p = 0.00 < 0.05)
                        Performance has improved.

//! authority = [ userinfo "@" ] host [ ":" port ]
//!
//!
//! userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail, but you never defined pct-encoded in this block while unreserved and sub-delims were defined.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll add that. Percent-encoding in the authority is not actually handled at the moment, but I'm not sure it's necessary. Have you seen any URLs using percent-encoding in the authority part in the wild? I think we can add it later if anyone requests it.

.take_while(|c| c.is_ascii_alphabetic())
.take(2)
.count()
>= 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused by that. How can this be bigger than 2 if we take 2 elements from the iterator? Perhaps you also want to add some unit tests.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye. You're right, it can't, using >= 2 and == 2 have exactly the same result here. But I still prefer to write it this way because it communicates the intent of the code more clearly, which is that we need at least 2 letters (not exactly 2 letters). I also like it because you could take away the .take(2) optimization and it would still be correct, whereas the == 2 variant would be incorrect without the take.

@robinst robinst merged commit b6ad06e into main Jul 11, 2022
@robinst robinst deleted the check-domains branch July 11, 2022 05:09
@mre
Copy link
Contributor

mre commented Jul 11, 2022

Thanks for your work and congratulations to the new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants