Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github action package versions categorized as mail address #29

Closed
mre opened this issue Jan 7, 2022 · 7 comments
Closed

Github action package versions categorized as mail address #29

mre opened this issue Jan 7, 2022 · 7 comments

Comments

@mre
Copy link
Contributor

mre commented Jan 7, 2022

In the README.md for lychee-action, there is a YAML file with the following content: lycheeverse/lychee-action@v1.1.1. This gets detected as an email address.
It's technically correct, because there is no upper limit on the number of dots in an address and a TLD is also not strictly required. However I'd argue that it's at least a surprising edge case.

Python considers it valid

Python 3.9.9 (main, Nov 21 2021, 03:16:13)
[Clang 13.0.0 (clang-1300.0.29.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from email.utils import parseaddr
>>> parseaddr('lycheeverse/lychee-action@v1.1.1')
('', 'lycheeverse/lychee-action@v1.1.1')

while PHP doesn't:

php -r "var_dump(filter_var('lycheeverse/lychee-action@v1.1.1', FILTER_VALIDATE_EMAIL));"
bool(false)

Would you say it makes sense to add a heuristic for this case? Otherwise I'd add a check over on lychee itself.

@lebensterben
Copy link

I wish the following would help when grepping email addresses:


According to https://datatracker.ietf.org/doc/html/rfc3696#section-3, an email address is of the form local-part@domain.

  • In most common cases, the local-part contains ASCII printable characters except @ \ " , [ ]:
    • That is, it allows ASCII alpha-numerical characters and the following special characters ! # $ % & ' * + - / = ? ^ _ ` . { | } ~
    • . cannot be the first or last character of local-part
  • In rare cases, the above excluded characters, white space, or even control characters are allowed in local-part if it's quoted
    • A single characters is quoted by a /, e.g. foo\@@example.com
    • The entire local-part can be quoted by double quotes, e.g. "foo bar@example.com

https://datatracker.ietf.org/doc/html/rfc3696#section-2 gives the restrictions on domain part of an email address.

  • domain consists of at least one label separated by ., e.g. foo.bar.com.
    • A trailing . is used to signify the last label being the root (TLD), e.g. example.com..
  • Traditionally, label may contain only ASCII alphanumerical characters and -.
    • Additional special characters including control characters are theoretically allowed if quoted. But in the context of email address we don't need to consider them.
    • But https://datatracker.ietf.org/doc/html/rfc3696#section-5 also specifies that label may contain non-ASCII characters for i18n.
    • So I propose that we can interpret this part as labels accept ASCII alphanumerical character, -, and any other unicode characters except special characters, control characters, or punctuation characters.
  • In addition, the TLD is expected not to be all numeric.

Length wise, section 2 and 3 of RFC 3696 says:

  • local-part has a maximum length of 64 bytes.
  • domain has a maximum length of 255 bytes.
    • Each label in a domain has a maximum length of 63 bytes.

@lebensterben
Copy link

For example, the following "URI" is not a valid email address:

github.com/golangci/golangci-lint/cmd/golangci-lint@v1.43.0

When parsing it, we found it's not quoted by ". And we only see ASCII alphabetical characters, -, and / until encounter @.

It also happens to be that the substring before @ is 51 bytes long and so it's okay to assume it's the local-part of an email address.

Then it sees v1.43.0 which seems like the domain part with three labels. Everything seems okay except that the last label, or the root, should be a TLD which cannot be all-numeric!.

Thus v1.43.0 cannot be the domain part of an email address and the whole string is not a valid email address.

@robinst
Copy link
Owner

robinst commented Jan 30, 2022

Thus v1.43.0 cannot be the domain part of an email address and the whole string is not a valid email address.

Hmm yeah. It might be worth looking at adding some code trying to detect this. Note that in the README, we explicitly state that we don't try to support IP addresses and quoting, so it's ok for us to exclude something like @127.0.0.1.

So if we only support domain names, we can check if the last part of the domain (TLD) is numeric only, and if it is, reject it. Here's an extensive comment with references supporting that: https://stackoverflow.com/a/53875771/305973

Does anyone want to have a go at this?

@lebensterben
Copy link

More bugs:


A tried with the following cases one by one:

hi@example.org- -> mailto:hi@example.org and -. (Incorrect, since domain label cannot ends with - so this entire string is invalid)

hi@-example.org -> hi@-example.org (Correct)

hi@-example-.org -> hi@-example-.org (Correct)

hi.s@-example.org- -> https://robinst.github.io/linkify/hi.s@-example.org- (Incorrect, domain label cannot start or end with - so the entire string is invalid)


Then I tried with putting them together:

hi@example.org-
hi@-example.org-
hi@-example.org
hi.s@-example.org-

->

mailto:hi@example.org
hi@-example.org-
hi@-example.org
hi.s@-example.org-

Note that this time the last line gets different results as it's no longer parsed as URL.

@robinst
Copy link
Owner

robinst commented Feb 1, 2022

hi@example.org- is linkified as hi@example.org-, which you could argue is similar to e.g. how hi@example.org) is treated: The email is extracted and ends before the illegal character, and the extra character is treated as delimiter text. So I don't see a problem there. You could argue that we should reject it in the case of - but I'm not sure about that.

hi.s@-example.org- being linked is actually because the Allow URLs without scheme (e.g. example.com) option is turned on and it's recognized as a URL. If you turn that off, it's no longer recognized. That's definitely a bug. We should probably do stricter domain checking for both emails and scheme-less URLs.

robinst added a commit that referenced this issue Jun 27, 2022
Came up in a couple of places: #41, #29, #38, #28. Hopefully we can fix
all of these with these changes.

Not done yet, still want to have domain checking for URLs with certain
schemes (https) but allow everything for others.

If we do that, we may be able to unify the email and plain domain
parsing with the scheme one too.
@robinst
Copy link
Owner

robinst commented Jul 11, 2022

A fix for this has been released with 0.9.0, see here: https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11

@robinst robinst closed this as completed Jul 11, 2022
@lebensterben
Copy link

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants