Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyyaml does not support literals in unicode over codepoint 0xffff #25

Closed
kitterma opened this issue Jan 16, 2016 · 20 comments
Closed

pyyaml does not support literals in unicode over codepoint 0xffff #25

kitterma opened this issue Jan 16, 2016 · 20 comments

Comments

@kitterma
Copy link

See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=806826

the yaml spec says that

“The allowed character range explicitly excludes the surrogate
block #xD800-#xDFFF, DEL #x7F, the C0 control block #x0-#x1F
(except for #x9, #xA, and #xD), the C1 control block #x80-#x9F,
#xFFFE, and #xFFFF.”

however pyyaml has chosen to negate that check and apply it to only
plane 0. This means that any yaml document that contains unicode
literals in higher planes will fail to parse (and, on output, use the
rather unfriendly \Uxxxxxxxx format).

The attached patch fixes this in a minimally intrusive way, by
extending the checks to cover the additional codepoints where
appropriate. A better fix would be to use the check as the spec
specifies it, but that would be a bigger change.

Index: pyyaml-3.11/lib/yaml/emitter.py

--- pyyaml-3.11.orig/lib/yaml/emitter.py
+++ pyyaml-3.11/lib/yaml/emitter.py
@@ -8,9 +8,13 @@

all = ['Emitter', 'EmitterError']

+import sys
+
from error import YAMLError
from events import *

+has_ucs4 = sys.maxunicode > 0xffff
+
class EmitterError(YAMLError):
pass

@@ -701,7 +705,8 @@ class Emitter(object):
line_breaks = True
if not (ch == u'\n' or u'\x20' <= ch <= u'\x7E'):
if (ch == u'\x85' or u'\xA0' <= ch <= u'\uD7FF'

  •                    or u'\uE000' <= ch <= u'\uFFFD') and ch != u'\uFEFF':
    
  •                    or u'\uE000' <= ch <= u'\uFFFD'
    
  •                    or ((not has_ucs4) or (u'\U00010000' <= ch < u'\U0010ffff'))) and ch != u'\uFEFF':
                 unicode_characters = True
                 if not self.allow_unicode:
                     special_characters = True
    

    Index: pyyaml-3.11/lib/yaml/reader.py

    --- pyyaml-3.11.orig/lib/yaml/reader.py
    +++ pyyaml-3.11/lib/yaml/reader.py
    @@ -19,7 +19,9 @@ all = ['Reader', 'ReaderError']

    from error import YAMLError, Mark

-import codecs, re
+import codecs, re, sys
+
+has_ucs4 = sys.maxunicode > 0xffff

class ReaderError(YAMLError):

@@ -134,7 +136,10 @@ class Reader(object):
self.encoding = 'utf-8'
self.update(1)

  • NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')

  • if has_ucs4:

  •    NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010ffff]')
    
  • else:

  •    NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')
    

    def check_printable(self, data):
    match = self.NON_PRINTABLE.search(data)
    if match:
    Index: pyyaml-3.11/lib3/yaml/emitter.py

    --- pyyaml-3.11.orig/lib3/yaml/emitter.py
    +++ pyyaml-3.11/lib3/yaml/emitter.py
    @@ -698,7 +698,8 @@ class Emitter:
    line_breaks = True
    if not (ch == '\n' or '\x20' <= ch <= '\x7E'):
    if (ch == '\x85' or '\xA0' <= ch <= '\uD7FF'

  •                    or '\uE000' <= ch <= '\uFFFD') and ch != '\uFEFF':
    
  •                    or '\uE000' <= ch <= '\uFFFD'
    
  •                    or '\U00010000' <= ch < '\U0010ffff') and ch != '\uFEFF':
                 unicode_characters = True
                 if not self.allow_unicode:
                     special_characters = True
    

    Index: pyyaml-3.11/lib3/yaml/reader.py

    --- pyyaml-3.11.orig/lib3/yaml/reader.py
    +++ pyyaml-3.11/lib3/yaml/reader.py
    @@ -134,7 +134,7 @@ class Reader(object):
    self.encoding = 'utf-8'
    self.update(1)

  • NON_PRINTABLE = re.compile('[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')

  • NON_PRINTABLE = re.compile('[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010ffff]')
    def check_printable(self, data):
    match = self.NON_PRINTABLE.search(data)
    if match:

@mortoray
Copy link

Any updates on this? It's a significant limitation for loading Unicode YAML files (since all emoji are > 0xFFFF and commonly used now)

@adamchainz
Copy link

+1, had a crash on encountering an emoji someone wrote in a YAML config file

BrianMitchL added a commit to BrianMitchL/weatherBot that referenced this issue Apr 9, 2017
@samdmarshall
Copy link

Going to add my voice to this as wanting to see this addressed as well.

@peterkmurphy
Copy link
Contributor

peterkmurphy commented May 6, 2017 via email

@samdmarshall
Copy link

@peterkmurphy is the patch in the original post (abit poorly formatted for display on github) not sufficient for submitting a PR?

@peterkmurphy
Copy link
Contributor

peterkmurphy commented May 6, 2017 via email

peterkmurphy added a commit to peterkmurphy/pyyaml that referenced this issue May 8, 2017
@peterkmurphy
Copy link
Contributor

peterkmurphy commented May 8, 2017 via email

@adamchainz
Copy link

@peterkmurphy yes more tests!

@adamchainz
Copy link

yaml.load('👍')

@peterkmurphy
Copy link
Contributor

peterkmurphy commented May 9, 2017 via email

@adamchainz
Copy link

Can you open a PR so it's easier to review?

@peterkmurphy
Copy link
Contributor

peterkmurphy commented May 9, 2017 via email

@adamchainz
Copy link

With a PR, we can comment on the code directly. Btw I'm in favour of removing the tests, I think you're fixing the behaviour.

@peterkmurphy
Copy link
Contributor

peterkmurphy commented May 9, 2017 via email

@peterkmurphy
Copy link
Contributor

peterkmurphy commented May 10, 2017 via email

adamchainz pushed a commit to adamchainz/pyyaml that referenced this issue May 16, 2017
adamchainz pushed a commit to adamchainz/pyyaml that referenced this issue May 16, 2017
@matt-ward
Copy link

So, is this fix getting pulled in?

@jlevy
Copy link

jlevy commented Nov 24, 2017

Just thought I'd share, in case others are in the same boat: After patching/working around this for a while, we realized https://pypi.python.org/pypi/ruamel.yaml handles higher Unicodes just fine, and it's worked out well for us.

@adamchainz
Copy link

this was merged in #63 but still hasn't been released :(

@TZanke
Copy link

TZanke commented Apr 24, 2018

This took me hours to find the source of our server to server communication issue.

What is the reason to dump a unicode control char (unicode: \x88, utf-8: \xc2\x88) to a yaml string, but disallow the same char to be loaded with pyyaml?

This is just ridiculous. A sometimes one-way-only-converter.

Edit: Whats the reason i see unicode in yaml instead of utf-8 (encoding='utf-8) ? This looks also not correct to me.

jpadilla/django-rest-framework-yaml#7

ingydotnet pushed a commit that referenced this issue Mar 14, 2019
@perlpunk
Copy link
Member

Fixed by #63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants