Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate local part unicode correctness #192

Closed
skeggse opened this issue Oct 12, 2018 · 3 comments
Closed

Validate local part unicode correctness #192

skeggse opened this issue Oct 12, 2018 · 3 comments

Comments

@skeggse
Copy link
Owner

skeggse commented Oct 12, 2018

In addition to checking whether the email could contain unicode code points, we should also ensure that the string representation is valid for UTF-8 conversion.

The challenge here is that we don't want to pull in additional dependencies, so we'll need to accomplish this with just the Node.js built-ins.

@skeggse skeggse self-assigned this Oct 12, 2018
@skeggse
Copy link
Owner Author

skeggse commented Oct 21, 2018

Per RFC 6531, Section 3.3, we'll want to ensure that the local part can be converted into valid UTF-8.

This includes validating that the local part does not include unpaired or misordered surrogates. We can easily determine this by rejecting runes in the range U+D800 to U+DFFF, as valid surrogate pairs will be returned as non-surrogate code points. An inefficient solution for identifying the incorrect use of surrogates would be /^[\ud800-\udfff]$/.test(rune).

Testing for the actual validity of UTF-8 encoded data is outside the scope of this module. We expect email addresses to be provided in their UTF-16 form, in keeping with the bulk of the ecmascript language specification.

Note that IDNs must be NFC-normalized, whereas the local part need merely be valid UTF8 (though a normalized form is encouraged, we must be permissive in what we accept).

Also note that our normalization routine may want to prefer the A-label form unless the local-part contains unicode characters.

@skeggse
Copy link
Owner Author

skeggse commented Oct 21, 2018

Additionally, it's not clear to me whether labels may contain noncharacters. The general expectation is likely to preserve such characters, and there's nothing in SMTPUTF8 that mentions special treatment of these characters. I propose we accept such characters unless IEFT publish errata clarifying the validity of noncharacters.

I'm also not seeing anything that suggests that the unicode characters between U+007B and U+00C0 are not valid in labels, so we should likely drop that restriction (cc @WesTyler).

skeggse added a commit that referenced this issue Oct 21, 2018
skeggse added a commit that referenced this issue Oct 21, 2018
skeggse added a commit that referenced this issue Oct 21, 2018
skeggse added a commit that referenced this issue Oct 21, 2018
skeggse added a commit that referenced this issue Oct 21, 2018
skeggse added a commit that referenced this issue Oct 21, 2018
@skeggse
Copy link
Owner Author

skeggse commented Oct 21, 2018

Fixed in #193.

@skeggse skeggse closed this as completed Oct 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant