More strict parsing of hostname (authority) part of URLs #43

robinst · 2022-06-27T07:18:32Z

Applies to emails, plain domains URLs (e.g. example.com/foo) and URLs with schemes where a host is expected (e.g. https).

This fixes a few problems that have been reported over time, namely:

https://www.example..com is no longer parsed as an URL (Matches URLs with consecutive periods #41)
foo@v1.1.1 is no longer parsed as an email address (Github action package versions categorized as mail address #29)
https://*.example.org is no longer parsed as an URL (Add support for skipping wildcard URLs #38)

It's a tricky change and hopefully this solves some problems while not introducing too many new ones. If anything unexpectedly changed for you, please let us know!

Came up in a couple of places: #41, #29, #38, #28. Hopefully we can fix all of these with these changes. Not done yet, still want to have domain checking for URLs with certain schemes (https) but allow everything for others. If we do that, we may be able to unify the email and plain domain parsing with the scheme one too.

Just need to think about how to handle port at the end, and optional trailing dot.

This is what I was aiming for :). It's a pretty complex function but I think it makes sense that the logic is unified. If we wanted to separate out something it would probably be the require_host = false case. Then we'd just have to duplicate the userinfo parsing.

Simplifies the code a bit, no external change.

The check was necessary before because we didn't check host names properly. Now that we reject a `@` as part of a hostname, it won't be recognized as a plain domain link anymore, so we don't need to check for emails. Makes the code nicer and faster - we'd do the email scan for every `.` trigger before.

robinst · 2022-06-27T07:43:29Z

As a nice side effect of this change, the benches have improved as well, 15% for some (comparing 9a6ce39 with 5bfb516):

no_links                time:   [25.885 ns 25.892 ns 25.901 ns]
                        change: [-2.2092% -2.0791% -1.9485%] (p = 0.00 < 0.05)
                        Performance has improved.

some_links              time:   [314.23 ns 314.37 ns 314.54 ns]
                        change: [-2.6413% -2.1750% -1.7947%] (p = 0.00 < 0.05)
                        Performance has improved.

heaps_of_links          time:   [1.0488 us 1.0502 us 1.0515 us]
                        change: [-14.292% -14.133% -13.997%] (p = 0.00 < 0.05)
                        Performance has improved.

some_links_without_scheme
                        time:   [389.63 ns 390.09 ns 390.54 ns]
                        change: [-15.163% -15.040% -14.905%] (p = 0.00 < 0.05)
                        Performance has improved.

mre · 2022-07-07T12:31:37Z

src/domains.rs

+//! authority   = [ userinfo "@" ] host [ ":" port ]
+//!
+//!
+//! userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )


Minor detail, but you never defined pct-encoded in this block while unreserved and sub-delims were defined.

Thanks, I'll add that. Percent-encoding in the authority is not actually handled at the moment, but I'm not sure it's necessary. Have you seen any URLs using percent-encoding in the authority part in the wild? I think we can add it later if anyone requests it.

mre · 2022-07-07T12:37:31Z

src/domains.rs

+        .take_while(|c| c.is_ascii_alphabetic())
+        .take(2)
+        .count()
+        >= 2


Confused by that. How can this be bigger than 2 if we take 2 elements from the iterator? Perhaps you also want to add some unit tests.

Good eye. You're right, it can't, using >= 2 and == 2 have exactly the same result here. But I still prefer to write it this way because it communicates the intent of the code more clearly, which is that we need at least 2 letters (not exactly 2 letters). I also like it because you could take away the .take(2) optimization and it would still be correct, whereas the == 2 variant would be incorrect without the take.

mre · 2022-07-11T11:14:03Z

Thanks for your work and congratulations to the new release.

robinst added 10 commits June 22, 2022 15:34

Getting close

43dab40

Just need to think about how to handle port at the end, and optional trailing dot.

Move docs around a bit, clean up

50986dc

Split URL and domain scanner classes

c87557d

Simplifies the code a bit, no external change.

Expand schemes that require host

dd6f96c

Add CHANGELOG

d690243

Re-enable deny warnings

6978a90

Fix use of unstable feature on 1.46

b012eff

robinst mentioned this pull request Jun 27, 2022

Matches URLs with consecutive periods #41

Closed

This was referenced Jul 1, 2022

Do not check links that contain wildcards in CSP rules lycheeverse/lychee#604

Closed

String with URL and Email ignores finder.url_must_have_scheme ? #44

Closed

mre reviewed Jul 7, 2022

View reviewed changes

robinst added 3 commits July 11, 2022 12:12

Add pct-encoded to docs

74e3e39

Add test cases from #44 that now work correctly on this branch

0f5b2e9

Pin dev dep to fix build on 1.46

97152fa

robinst merged commit b6ad06e into main Jul 11, 2022

robinst deleted the check-domains branch July 11, 2022 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More strict parsing of hostname (authority) part of URLs #43

More strict parsing of hostname (authority) part of URLs #43

robinst commented Jun 27, 2022

robinst commented Jun 27, 2022

mre Jul 7, 2022

robinst Jul 11, 2022

mre Jul 7, 2022

robinst Jul 11, 2022

mre commented Jul 11, 2022

More strict parsing of hostname (authority) part of URLs #43

More strict parsing of hostname (authority) part of URLs #43

Conversation

robinst commented Jun 27, 2022

robinst commented Jun 27, 2022

mre Jul 7, 2022

Choose a reason for hiding this comment

robinst Jul 11, 2022

Choose a reason for hiding this comment

mre Jul 7, 2022

Choose a reason for hiding this comment

robinst Jul 11, 2022

Choose a reason for hiding this comment

mre commented Jul 11, 2022