Support unicode email addresses #17

skeggse · 2016-01-06T00:47:34Z

See also buildmail:lib/buildmain.js line 890, and RFC 6530.

KyleAMathews · 2016-02-03T17:04:14Z

Just processed a few 100,000s of emails and ran into a number of unicode emails that were failing e.g. êjness@something.com

skeggse · 2016-02-03T18:08:50Z

Ah, good to know this applies to someone. I'll take a look. Do you think you could submit a PR that adds some useful test cases for Unicode?

navroopsingh · 2016-05-06T06:27:23Z

what's the unicode character range we are aiming to check for as valid?
charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit

skeggse · 2016-05-06T06:30:10Z

I haven't had a chance to read and digest the existing RFCs pertaining to unicode in email addresses. My intention is to follow the standard, which I hope will include the full unicode range. Sorry this hasn't moved forward, my school year finally ended, so I hope to have some time to work on this.

WesTyler · 2017-02-06T18:44:24Z

Re: RFC citations and full implementation -
Here are our findings that we believe should inform the work. Please let me know what's missing or not lining up with your expectations.

RFC 6530

Internationalized to utilize the full range of Unicode characters.
Unicode C0 and C1 control characters are prohibited by RFC 5321 and RFC 5198.
References v6 of the Unicode standard.
Permits the use of UTF-8 encoded strings in email addresses, both local parts and domain names.

skeggse · 2017-02-06T18:57:18Z

Thanks for summarizing your research, that's super helpful! I'm busy at this exact moment, I'll try to follow up with some questions in about an hour.

skeggse · 2017-02-06T20:02:42Z

Do you think the Unicode v6 stipulation is important for the implementation itself, given js/v8 support for unicode, broadly speaking? Do you think, in pursuit of supporting unicode email addresses, that this module should provide a normalization function which would punycode the domain as appropriate? In the latter case, would the normalization function need to apply any transformations to the local part (apart from stripping out comments and folding whitespace) to make it compatible with any SMTP implementations?

WesTyler · 2017-02-06T20:08:24Z

Do you think the Unicode v6 stipulation is important for the implementation itself, given js/v8 support for unicode, broadly speaking

I wouldn't think so, no. Just as a basic "support at minimum these characters" guide. I think you're right that the Node/v8 unicode support would cover the RFC.

From my understanding of the RFC (I most likely misunderstood at minimum one part :) ), punycoding should not be necessary. The RFC covers internationalization of not only the local/domain piece of email addresses, but "SMTP Extension for Internationalized Email Address" as well. I'm not 100% on that assertion though

WesTyler · 2017-02-06T20:10:43Z

This is the part of the RFC I'm looking at:
Especially "such transformations imply that the originating user or system must have ASCII-only addresses available for all senders and recipients"

Downgrading before and after SMTP Transactions
An important issue with these extensions is how to handle
interactions between systems that support non-ASCII addresses and
legacy systems that expect ASCII. There is, of course, no problem
with ASCII-only systems sending to those that can handle
internationalized forms because the ASCII forms are just a proper
subset. But, when systems that support these extensions send mail,
they MAY include non-ASCII addresses for senders, receivers, or both
and might also provide non-ASCII header information other than
addresses. If the extension is not supported by the first-hop system
(i.e., the SMTP server accessed by the submission server acting as an
SMTP client), message-originating systems SHOULD be prepared to
either send conventional envelopes and message headers or to return
the message to the originating user so the message may be manually
downgraded to the traditional form, possibly using encoded words
[RFC2047] in the message headers. Of course, such transformations
imply that the originating user or system must have ASCII-only
addresses available for all senders and recipients. Mechanisms by
which such addresses may be found or identified are outside the scope

skeggse · 2017-02-06T20:23:57Z

Thanks for that. I brought up punycode because at some point, in order to be effective for a delivery mechanism, the domain name must be looked up in the DNS to find the IP address to connect to. As such, it must follow the requirements of a valid domain name, which I recall as being a subset of ASCII characters. Whether the email address is valid for the envelope is a separate issue, and I'd like to find a more definitive specification than "possibly using encoded words."

WesTyler · 2017-02-06T20:25:32Z

Ah, ok, I'm with you. I'll keep digging through the RFC(s), but you may be right that punycoding the domain would be a useful feature.

skeggse · 2017-02-06T20:28:36Z

Yep. I have this dream (predicated on having free time) that I'll transform this library into a general-purpose email address validation, normalization and transformation library. There are parts of nodemailer that seem redundant wrt email address parsing and the subsequent paths of validation and delivery.

I also seem to recall that there was an RFC that proposed specific encodings for unicode compatibility for email addresses, which was maybe deprecated in favor of UTF-8? Not sure.

WesTyler · 2017-02-06T20:34:07Z

Oh, cool!

I've only been able to find UTF-8 support in the RFCs I've crawled through so far (6530 being the main reference).

skeggse · 2017-02-06T20:34:34Z

To that end, it might just make sense to pull in a parser generator (or maybe PEG if it's sufficiently powerful) which supports the right Unicode features. Then this entire library would basically be generating the appropriate errors. Not entirely sure that error handing/recovery in existing parser generators is sufficient, though - the existing implementation does its best to output usable error codes with the hope that they might let users know what they did wrong.

skeggse · 2017-02-06T20:35:44Z

Gotcha. Maybe that idea was set aside in favor of the better supported existing UTF-8 encoding.

WesTyler · 2017-02-06T20:43:55Z

So how would you feel about splitting this out into 2 bodies of work? One to allow for simple validation of emails containing non-ascii unicode characters, and one for building out the punycode normalizing function?

That way the basic email validation is useable sooner rather than later, and the long-term goal of the project doesn't get dropped off?

skeggse · 2017-02-06T20:51:05Z

Sounds good to me. Feel free to submit a pull request based on your findings (bonus points if it references specific sections of RFCs, though it sounds like some of the references are simply that there aren't many specific things to reference). I'd like it to pass #18, and not break existing tests. I'd also like you to run the PR against the code style checker for the hapijs org using lab (pretty sure that's just included in npm test).

One final thing to consider: I know UTF-8 has support for multibyte encoding of e.g. null characters. Pretty sure it's a non-issue for this, but just double-check that it won't have unintended impacts.

If you don't get around to this, I'll try to block out some time in the next day or so to take care of it.

WesTyler · 2017-02-06T20:52:50Z

Sounds great, thanks @skeggse. I'll see what we can pull out this afternoon/tomorrow AM. :)

skeggse · 2017-02-06T22:52:09Z

RE: comments on #18

@skeggse even with unicode character support, this fails on the DNS lookup because of iana.com vs iana.org.

Hm, gotcha.

This is a good problem to notice. This also indicates that unicode support in the domain names will break checkDNS, as the resolve family of functions are not unicode compatible or aware, which goes back to my question about punycode. It seems we'll need just a little more to avoid breaking existing functionality in unexpected ways.

I think it would be sufficient to add two tests:

one test that tests the three email addresses from this PR, but includes checkDNS: false in the options
one test (preferably in something like test/tmp.js) that uses rewire to verify that dns.resolveMx is called with the appropriately punycoded string

I'd like the latter test to be separate because I'd prefer to avoid mocking/mimicking separately defined functionality. It's not the worse, but I'd prefer to find a canonical example i18n domain we can rely on, and use that in the normal test process.

EDIT: maybe you could also pull in the other test example email addresses from #19

WesTyler · 2017-02-07T17:22:10Z

Relevant: nodejs/node#11218

WesTyler · 2017-02-07T18:02:02Z

So, here's the thing... Domain punycode is necessary, and the Node punycode API is being deprecated in favor of the community punycode.js library.

nodejs/node#11218
https://nodejs.org/api/punycode.html

As I move forward with this work I'm going to try out rewire and punycode.js.

WesTyler · 2017-02-07T18:21:41Z

@skeggse looks like using rewire for the dns call will require me to refactor const Dns = require('dns'); to let Dns = require('dns');.

I know that is technically against Hapi code standards since Dns is not reassigned inside of lib/index.js, but how do you feel about this?
Line 5: prefer-const - 'Dns' is never reassigned. Use 'const' instead.

jhnns/rewire#79

skeggse · 2017-02-07T18:25:20Z

Ah, interesting. I'm tempted to refactor the API, and upgrade isemail to a new major rev, without support for checkDNS - I feel that that's almost a separate concern from isemail itself, and this would require users to figure out what exactly they want to check for. For some users it might be sufficient if any record exists for the domain name. In other cases, they want to verify that the server will accept the given localpart as a mailbox. Maybe that's outside the scope of the work you want to do.

I think using let for the moment is fine, but I know hapijs prefers to keep module imports const, generally speaking.

WesTyler · 2017-02-07T18:27:17Z

Thanks! I'll use let for now.

Just as a side note, I would support a major rev that drops checkDNS. :D

skeggse · 2017-02-15T22:29:26Z

0f795c5

skeggse · 2017-02-15T22:35:34Z

@Marsup I know joi uses this package - do you think I should release this as 3.0.0 instead of 2.3.0 to avoid compatibility issues? I'm a little concerned that maybe some people depend on this rejecting unicode characters. Thoughts?

hueniverse · 2017-02-16T00:55:22Z

I am not worried about that "breaking change". I am worried about the new external dependency.

skeggse · 2017-02-16T02:51:01Z

Hm, granted. Alternative idea: I want to stop supporting the dns resolution check. As a result, we wouldn't need punycode.js, and would be able to simplify the API.

On that note, @WesTyler: I just realized that while we're checking the number of octets it takes to represent the domain as octets, it doesn't account for the extra four bytes, or the actual encoding scheme used by punycode. The simplest solution there is to just use punycode.js, but maybe we can get it to work without punycode (and then could get rid of the external dependency altogether).

Marsup · 2017-02-16T08:26:54Z

FWIW, this module is the one advised by node core itself, so I'd assume this is a safe dependency.
What do you consider breaking ? The fact that it now supports unicode characters ?

hueniverse · 2017-02-16T08:45:19Z

It is still an external dep that I need to review every time they change something. If there is a way not to require it, that would be preferred.

WesTyler · 2017-02-16T15:03:10Z

but maybe we can get it to work without punycode (and then could get rid of the external dependency altogether)

@skeggse so are you proposing removing the normalization, or rolling out an internal normalization implementation...?

skeggse · 2017-02-16T16:51:21Z

No, the normalization is here to stay, and it doesn't handle punycode anyway.

At the moment, it seems that dns checking is superfluous to the goals of this module, and getting rid of it would also mean we wouldn't need the output of punycode. We'd need to figure out how to calculate what a punycoded domain's length would be, though.

Down the line, though, I'd like to have this module provide parsing and normalization (to avoid duplicate processing of email addresses), which would necessitate punycode. So not entirely sure that's the right way to go, but it would solve the problem temporarily.

WesTyler · 2017-02-16T17:09:32Z

Oh, my bad, sorry. I forgot that the punycoding and normalization were 2 different steps for 2 different problems XD.

I agree with the judgement that dns checking (and therefore punycoding) is likely superfluous for this module right now. Is it worth opening up a new issue to isolate that effort from the Unicode work in this ticket and the merged PR? If so I can open it up and pull the info from above into it. I can probably carve out some time this afternoon or tomorrow morning to work on it if nobody else gets to it first.

skeggse · 2017-02-16T21:25:09Z

Yeah that sgtm

hueniverse added the enhancement label Jan 6, 2016

KyleAMathews mentioned this issue Feb 3, 2016

Test unicode emails #18

Closed

navroopsingh mentioned this issue May 7, 2016

my attempt to resolve #17 #19

Closed

skeggse mentioned this issue Jun 20, 2016

International email support #23

Closed

antony mentioned this issue Jun 20, 2016

Email validation doesn't allow emails which are compliant with RFC 6530 hapijs/joi#931

Closed

andris9 mentioned this issue Sep 13, 2016

SMTPConnection.prototype._parseAddressCommand is not compliant to RFC 5322 zone-eu/zone-mta#5

Closed

WesTyler mentioned this issue Feb 7, 2017

Support Unicode international characters #151

Merged

skeggse closed this as completed Feb 15, 2017

WesTyler mentioned this issue Feb 17, 2017

Remove dns lookup functionality and remove punycode dependency #152

Closed

skeggse assigned WesTyler Mar 23, 2018

skeggse unassigned WesTyler Apr 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support unicode email addresses #17

Support unicode email addresses #17

skeggse commented Jan 6, 2016

KyleAMathews commented Feb 3, 2016

skeggse commented Feb 3, 2016

navroopsingh commented May 6, 2016

skeggse commented May 6, 2016

WesTyler commented Feb 6, 2017 •

edited

skeggse commented Feb 6, 2017

skeggse commented Feb 6, 2017

WesTyler commented Feb 6, 2017 •

edited

WesTyler commented Feb 6, 2017 •

edited

skeggse commented Feb 6, 2017

WesTyler commented Feb 6, 2017

skeggse commented Feb 6, 2017

WesTyler commented Feb 6, 2017

skeggse commented Feb 6, 2017

skeggse commented Feb 6, 2017

WesTyler commented Feb 6, 2017

skeggse commented Feb 6, 2017

WesTyler commented Feb 6, 2017

skeggse commented Feb 6, 2017 •

edited

WesTyler commented Feb 7, 2017

WesTyler commented Feb 7, 2017

WesTyler commented Feb 7, 2017 •

edited

skeggse commented Feb 7, 2017

WesTyler commented Feb 7, 2017

skeggse commented Feb 15, 2017

skeggse commented Feb 15, 2017

hueniverse commented Feb 16, 2017

skeggse commented Feb 16, 2017

Marsup commented Feb 16, 2017

hueniverse commented Feb 16, 2017

WesTyler commented Feb 16, 2017 •

edited

skeggse commented Feb 16, 2017

WesTyler commented Feb 16, 2017

skeggse commented Feb 16, 2017

Support unicode email addresses #17

Support unicode email addresses #17

Comments

skeggse commented Jan 6, 2016

KyleAMathews commented Feb 3, 2016

skeggse commented Feb 3, 2016

navroopsingh commented May 6, 2016

skeggse commented May 6, 2016

WesTyler commented Feb 6, 2017 • edited

RFC 6530

skeggse commented Feb 6, 2017

skeggse commented Feb 6, 2017

WesTyler commented Feb 6, 2017 • edited

WesTyler commented Feb 6, 2017 • edited

skeggse commented Feb 6, 2017

WesTyler commented Feb 6, 2017

skeggse commented Feb 6, 2017

WesTyler commented Feb 6, 2017

skeggse commented Feb 6, 2017

skeggse commented Feb 6, 2017

WesTyler commented Feb 6, 2017

skeggse commented Feb 6, 2017

WesTyler commented Feb 6, 2017

skeggse commented Feb 6, 2017 • edited

WesTyler commented Feb 7, 2017

WesTyler commented Feb 7, 2017

WesTyler commented Feb 7, 2017 • edited

skeggse commented Feb 7, 2017

WesTyler commented Feb 7, 2017

skeggse commented Feb 15, 2017

skeggse commented Feb 15, 2017

hueniverse commented Feb 16, 2017

skeggse commented Feb 16, 2017

Marsup commented Feb 16, 2017

hueniverse commented Feb 16, 2017

WesTyler commented Feb 16, 2017 • edited

skeggse commented Feb 16, 2017

WesTyler commented Feb 16, 2017

skeggse commented Feb 16, 2017

WesTyler commented Feb 6, 2017 •

edited

WesTyler commented Feb 6, 2017 •

edited

WesTyler commented Feb 6, 2017 •

edited

skeggse commented Feb 6, 2017 •

edited

WesTyler commented Feb 7, 2017 •

edited

WesTyler commented Feb 16, 2017 •

edited