Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine surrogate pairs into one escape sequence when dumping. #369

Merged
merged 1 commit into from Sep 11, 2017

Conversation

tech4him1
Copy link
Contributor

@tech4him1 tech4him1 commented Sep 5, 2017

When an astral character is encountered while dumping to a double-quoted string, save it as an 32-bit escaped Unicode sequence instead of as a surrogate pair (\U0001F600 instead of \uD83D\uDE00).

Closes #368.

@tech4him1 tech4him1 force-pushed the combine-dumping-surrogate-pairs branch from e1b3f9a to 1393e25 Compare September 5, 2017 18:26
@tech4him1 tech4him1 changed the title Combine surrogate pairs into one escape sequence when encoding. WIP: Add option to combine surrogate pairs into one escape sequence when dumping. Sep 5, 2017
@puzrin
Copy link
Member

puzrin commented Sep 5, 2017

Please don't forget to provide proofs that such encoding is valid.

@tech4him1
Copy link
Contributor Author

@puzrin I'm sorry, are you wanting proof that it is valid according to the spec, or that this code produces valid output?

@puzrin
Copy link
Member

puzrin commented Sep 6, 2017

that it is valid according to the yaml spec

@puzrin
Copy link
Member

puzrin commented Sep 6, 2017

Alternative could be to leave surrogates unencoded. Then after string will be saved in utf-8 (in file, for example), characters become "native".

But this also requires investigation that such thing is allowed in yaml.

@tech4him1
Copy link
Contributor Author

tech4him1 commented Sep 6, 2017

Encoding astral characters as 32-bit escape sequences (like \U0001F600):

  • This method is provided in the YAML spec here: http://www.yaml.org/spec/1.2/spec.html#ns-esc-32-bit

  • Also, this is the method that PyYAML uses:

    >>> import yaml
    >>> print(yaml.dump({'testkey': "🐣 🌼 🌷 🌱 🌳 🍃 🌈"}))
    testkey: "\U0001F423 \U0001F33C \U0001F337 \U0001F331 \U0001F333 \U0001F343 \U0001F308"
  • It is also the method go-yaml and the reference parser uses.

  • Here is a reply I received from Leon Timmermans on the YAML core mailing list (https://sourceforge.net/p/yaml/mailman/message/36026528/):

    Javascript (and a few other languages with UTF-16 implementation details
    leaking out) has a tendency to treat such characters as two surrogates
    ("\uD83D\uDCA9"), instead of as a single character ("\U0001F4A9"). Quite
    frankly I think this is unhelpful and wrong, but JSON actually made it a
    standard -_-.

    The YAML spec explicitly bans literal surrogate pairs, but is silent on escaped surrogates. Nothing in it suggests they are supported, except the suggestion of JSON compatibility. \U on the other hand is required to be supported. I don't think putting a literal astral printable character is erroneous, but quoted is probably safer whenever possible.

Is that what you were wanting as proof, or something else?

@puzrin
Copy link
Member

puzrin commented Sep 6, 2017

Exactly. References are solid. Thank you.

One more question. Could this be a problem with JSON in JS? (i think - no)

@tech4him1
Copy link
Contributor Author

Partially, yes. There are two related things here that are causing this in JS and JSON.

The first is that JavaScript internally uses either UCS-2 or UTF-16 (https://mathiasbynens.be/notes/javascript-encoding). So whenever you use string.charCodeAt(i) it will return the surrogate pairs, as that is how they are actually stored in the string. ES6 introduces string.codePointAt(i) which takes surrogate pairs into account, but I didn't think you wanted to use ES6 here, so as to keep js-yaml backwards compatible.

The second issue is that in JSON, astral characters are supposed to be encoded as surrogate pairs (see JSON spec section "9 String"), as JSON does not have long escape sequences. This does make it confusing to convert between JSON and YAML.

@tech4him1
Copy link
Contributor Author

@puzrin Also, do you think this should be an option, or the default? If an option, do you have an idea what it should be named (maybe astralsAsSurrogates)?

@puzrin
Copy link
Member

puzrin commented Sep 6, 2017

I think - default will be good enougth.

@tech4him1 tech4him1 changed the title WIP: Add option to combine surrogate pairs into one escape sequence when dumping. Combine surrogate pairs into one escape sequence when dumping. Sep 6, 2017
@tech4him1
Copy link
Contributor Author

tech4him1 commented Sep 6, 2017

@puzrin Also, the loader should already be able to parse these 32-bit sequences, correct? https://github.com/tech4him1/js-yaml/blob/1393e2539556776899d274cce2cc972b5b9ae69f/lib/js-yaml/loader.js#L114-L119

It works for me, I just wanted to check with you as well.

@tech4him1 tech4him1 force-pushed the combine-dumping-surrogate-pairs branch from 1393e25 to 640f868 Compare September 6, 2017 16:41
@tech4him1
Copy link
Contributor Author

This PR should be ready to merge, unless you have any changes that you want made.

@puzrin
Copy link
Member

puzrin commented Sep 11, 2017

published

@tech4him1 tech4him1 deleted the combine-dumping-surrogate-pairs branch September 11, 2017 13:24
@tech4him1
Copy link
Contributor Author

Thank you!

minj added a commit to minj/js-yaml that referenced this pull request Apr 28, 2018
* master: (58 commits)
  Check for leading newlines when determining if block indentation indicator is needed (nodeca#404)
  Add property based tests to assess load reverses dump (nodeca#398)
  3.11.0 released
  Browser files rebuild
  Dumper: fix negative integers in bin/octal/hex formats, close nodeca#399
  support es6 arrow functions, fixes nodeca#389 (nodeca#393)
  Fix typo in README.md (nodeca#373)
  3.10.0 released
  Browser files rebuild
  Add test for astrals dump
  Combine surrogate pairs into one escape sequence when encoding. (nodeca#369)
  Fix condenseFlow for objects (nodeca#371)
  correct spelling mistake (nodeca#367)
  More meaningful error for loader (nodeca#361)
  Fix typo and format code. (nodeca#365)
  3.9.1 released
  Browser files rebuild
  Ensure stack is present for custom errors (fixes nodeca#351) (nodeca#360)
  3.9.0 released
  Browser files rebuild
  ...
@rlidwka
Copy link
Member

rlidwka commented Dec 1, 2020

This code was changed in 0b2e702, astral characters are no longer escaped, see discussion in #587.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support encoding emojis without encoding as surrogate pairs
3 participants