Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support encoding emojis without encoding as surrogate pairs #368

Closed
tech4him1 opened this issue Sep 5, 2017 · 10 comments 路 Fixed by #369
Closed

Support encoding emojis without encoding as surrogate pairs #368

tech4him1 opened this issue Sep 5, 2017 · 10 comments 路 Fixed by #369

Comments

@tech4him1
Copy link
Contributor

tech4him1 commented Sep 5, 2017

Currently, if an emoji is in a YAML string, and run through safeDump, it will be converted into escaped surrogate pairs (i.e. thiskey: "馃榾" is dumped as thiskey: "\uD83D\uDE00"). This is caused by lib/js-yaml/dumper.js#L468, as using JS charCodeAt returns surrogate pairs by default (see https://mathiasbynens.be/notes/javascript-unicode). Would you be willing to at least make this configurable, so that we could choose whether to convert into surrogate pairs or write the emojis directly? If you want a PR, I can help with that as well, just wanted to see what your thoughts were.

@puzrin
Copy link
Member

puzrin commented Sep 5, 2017

What yaml spec says about astrals encoding?

@puzrin
Copy link
Member

puzrin commented Sep 5, 2017

http://www.yaml.org/spec/1.2/spec.html#Characters seems spec does not allow what you requested

@tech4him1
Copy link
Contributor Author

@puzrin I'm sorry, I don't understand which part of the spec that you linked means that astrals must be encoded as surrogate pairs. Could you explain to me why YAML astrals have to be surrogate pairs.

@puzrin
Copy link
Member

puzrin commented Sep 5, 2017

There is list of allowed codes. Astrals are not there.

To ensure readability, YAML streams use only the printable subset of the Unicode character set. The allowed character range explicitly excludes the C0 control block #x0-#x1F (except for TAB #x9, LF #xA, and CR #xD which are allowed), DEL #x7F, the C1 control block #x80-#x9F (except for NEL #x85 which is allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF.

@tech4him1
Copy link
Contributor Author

tech4him1 commented Sep 5, 2017

The section that you quoted there say that that is the excluded range:

The allowed character range explicitly excludes...

I thought that means that those (control block and surrogate block) are the non-printable characters which must be escaped, and the rest could be printed (including astrals)? Is there something that I am misunderstanding here?

@puzrin
Copy link
Member

puzrin commented Sep 5, 2017

That's a reason why surrogates are hex-encoded. And there are no guarantee that astrals will be printable, because that depends on unicode support in OS.

I short - i can understand your request and have no principal objections is it follows spec. But someone should contact spec authors to clarify details.

@tech4him1
Copy link
Contributor Author

tech4him1 commented Sep 5, 2017

@puzrin Thank you. Would you be willing to have an option whether to encode as a 32-bit escape instead of surrogate pairs ("\U0001F600" instead of "\uD83D\uDE00"), or do you think that would be affected by the same problem (cross-platform Unicode support)?

@puzrin
Copy link
Member

puzrin commented Sep 5, 2017

I don't care at all :). If someone will give a ready recipe how to do it without breaking spec, it should not be difficult to implement. But i can't participate in this investigation - have to do another projects.

@tech4him1
Copy link
Contributor Author

tech4him1 commented Sep 5, 2017

@puzrin Thank you for your time, really appreciate it! I will get a PR ready.

tech4him1 added a commit to decaporg/decap-cms that referenced this issue Sep 11, 2017
The main fix we are wanting is outputting astral characters (emojis) as
a single escape instead of surrogate pairs: nodeca/js-yaml#368.
tech4him1 added a commit to decaporg/decap-cms that referenced this issue Sep 11, 2017
The main fix we are wanting is outputting astral characters (emojis) as
a single escape instead of surrogate pairs: nodeca/js-yaml#368.
Benaiah pushed a commit to decaporg/decap-cms that referenced this issue Sep 16, 2017
* Upgrade `js-yaml` to 3.10.0.

The main fix we are wanting is outputting astral characters (emojis) as
a single escape instead of surrogate pairs: nodeca/js-yaml#368.

* Upgrade `preliminaries` front-matter parser (and dependencies).
rlidwka added a commit that referenced this issue Dec 1, 2020
@puzrin
Copy link
Member

puzrin commented Dec 1, 2020

UPD: from 4.0 pairs are not encoded anymore (emojies pass as is).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants