Matches URLs with consecutive periods #41

federicofusco · 2022-06-09T10:07:50Z

Love this library, although I found that it will match urls with consecutive periods in the domain (e.i https://www.example..com).
I know that the RFCs are a mess with urls and (especially) emails so I was wondering if this is something to fix or done on purpose

The text was updated successfully, but these errors were encountered:

mre · 2022-06-09T10:44:14Z

According to https://stackoverflow.com/a/27142527 a "url with many dots is valid. However a domain name with multiple consecutive dots is not valid since the length of each label has to be more than 0."
So I'd say we should exclude them. I'm not a maintainer, though.

robinst · 2022-06-13T02:45:03Z

Yeah, we currently don't check much about the domain. There's a few issues that I think could all be solved by properly checking domains. Hopefully I can take a look at that soon.

robinst · 2022-06-13T07:51:56Z

So my initial thinking was: For http and https and similar schemes, require the host part to be valid according to DNS, i.e. reject things such as www.example..com. For other schemes, we'd leave it open. But having read the relevant RFC, https://datatracker.ietf.org/doc/html/rfc3986#page-21 is interesting:

This specification does not mandate a particular registered name
lookup technology and therefore does not restrict the syntax of reg-
name beyond what is necessary for interoperability. Instead, it
delegates the issue of registered name syntax conformance to the
operating system of each application performing URI resolution, and
that operating system decides what it will allow for the purpose of
host identification. A URI resolution implementation might use DNS,
host tables, yellow pages, NetInfo, WINS, or any other system for
lookup of registered names. However, a globally scoped naming
system, such as DNS fully qualified domain names, is necessary for
URIs intended to have global scope. URI producers should use names
that conform to the DNS syntax, even when use of DNS is not
immediately apparent, and should limit these names to no more than
255 characters in length.

Especially given the last sentence, I was thinking we could mandate DNS syntax for the authority part regardless of scheme.

But then I had a look through https://en.wikipedia.org/wiki/List_of_URI_schemes and found this example: facetime://+19995551234 which would then be rejected.

So maybe we do need to distinguish schemes, and only apply strict checking for some schemes (http, https, file, ftp, sftp ...?), and for plain domain names and email addresses.

mre · 2022-06-17T13:15:48Z

We could start with a conservative set of schemes (like http and https) and add more if we encounter missing edge cases.

However false positives are probably worse than false negatives when it comes to link detection. At the moment https://example..com gets detected and I'd argue that facetime URLs are less common. So maybe rejecting consecutive periods for all schemes and adding facetime as an exception is the way to go?

Came up in a couple of places: #41, #29, #38, #28. Hopefully we can fix all of these with these changes. Not done yet, still want to have domain checking for URLs with certain schemes (https) but allow everything for others. If we do that, we may be able to unify the email and plain domain parsing with the scheme one too.

robinst · 2022-06-27T07:44:59Z

@mre and @federicofusco I pushed a PR overhauling domain parsing, see here: #43

It would be awesome if you could give it a try and check for regressions!

federicofusco · 2022-06-27T08:04:50Z

Thanks for the PR! I'll check out the changes soon.

robinst · 2022-07-11T05:16:26Z

The fix for this has been released as 0.9.0, see here: https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11

robinst mentioned this issue Jun 27, 2022

More strict parsing of hostname (authority) part of URLs #43

Merged

robinst closed this as completed Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matches URLs with consecutive periods #41

Matches URLs with consecutive periods #41

federicofusco commented Jun 9, 2022

mre commented Jun 9, 2022

robinst commented Jun 13, 2022

robinst commented Jun 13, 2022

mre commented Jun 17, 2022

robinst commented Jun 27, 2022

federicofusco commented Jun 27, 2022

robinst commented Jul 11, 2022

Matches URLs with consecutive periods #41

Matches URLs with consecutive periods #41

Comments

federicofusco commented Jun 9, 2022

mre commented Jun 9, 2022

robinst commented Jun 13, 2022

robinst commented Jun 13, 2022

mre commented Jun 17, 2022

robinst commented Jun 27, 2022

federicofusco commented Jun 27, 2022

robinst commented Jul 11, 2022