Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matches URLs with consecutive periods #41

Closed
federicofusco opened this issue Jun 9, 2022 · 7 comments
Closed

Matches URLs with consecutive periods #41

federicofusco opened this issue Jun 9, 2022 · 7 comments

Comments

@federicofusco
Copy link

Love this library, although I found that it will match urls with consecutive periods in the domain (e.i https://www.example..com).
I know that the RFCs are a mess with urls and (especially) emails so I was wondering if this is something to fix or done on purpose

@mre
Copy link
Contributor

mre commented Jun 9, 2022

According to https://stackoverflow.com/a/27142527 a "url with many dots is valid. However a domain name with multiple consecutive dots is not valid since the length of each label has to be more than 0."
So I'd say we should exclude them. I'm not a maintainer, though.

@robinst
Copy link
Owner

robinst commented Jun 13, 2022

Yeah, we currently don't check much about the domain. There's a few issues that I think could all be solved by properly checking domains. Hopefully I can take a look at that soon.

@robinst
Copy link
Owner

robinst commented Jun 13, 2022

So my initial thinking was: For http and https and similar schemes, require the host part to be valid according to DNS, i.e. reject things such as www.example..com. For other schemes, we'd leave it open. But having read the relevant RFC, https://datatracker.ietf.org/doc/html/rfc3986#page-21 is interesting:

This specification does not mandate a particular registered name
lookup technology and therefore does not restrict the syntax of reg-
name beyond what is necessary for interoperability. Instead, it
delegates the issue of registered name syntax conformance to the
operating system of each application performing URI resolution, and
that operating system decides what it will allow for the purpose of
host identification. A URI resolution implementation might use DNS,
host tables, yellow pages, NetInfo, WINS, or any other system for
lookup of registered names. However, a globally scoped naming
system, such as DNS fully qualified domain names, is necessary for
URIs intended to have global scope. URI producers should use names
that conform to the DNS syntax, even when use of DNS is not
immediately apparent, and should limit these names to no more than
255 characters in length.

Especially given the last sentence, I was thinking we could mandate DNS syntax for the authority part regardless of scheme.

But then I had a look through https://en.wikipedia.org/wiki/List_of_URI_schemes and found this example: facetime://+19995551234 which would then be rejected.

So maybe we do need to distinguish schemes, and only apply strict checking for some schemes (http, https, file, ftp, sftp ...?), and for plain domain names and email addresses.

@mre
Copy link
Contributor

mre commented Jun 17, 2022

We could start with a conservative set of schemes (like http and https) and add more if we encounter missing edge cases.

However false positives are probably worse than false negatives when it comes to link detection. At the moment https://example..com gets detected and I'd argue that facetime URLs are less common. So maybe rejecting consecutive periods for all schemes and adding facetime as an exception is the way to go?

robinst added a commit that referenced this issue Jun 27, 2022
Came up in a couple of places: #41, #29, #38, #28. Hopefully we can fix
all of these with these changes.

Not done yet, still want to have domain checking for URLs with certain
schemes (https) but allow everything for others.

If we do that, we may be able to unify the email and plain domain
parsing with the scheme one too.
@robinst
Copy link
Owner

robinst commented Jun 27, 2022

@mre and @federicofusco I pushed a PR overhauling domain parsing, see here: #43

It would be awesome if you could give it a try and check for regressions!

@federicofusco
Copy link
Author

Thanks for the PR! I'll check out the changes soon.

@robinst
Copy link
Owner

robinst commented Jul 11, 2022

The fix for this has been released as 0.9.0, see here: https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11

@robinst robinst closed this as completed Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants