Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reader for Unicode code points over 0xFFFF #351

Merged
merged 1 commit into from Dec 8, 2019
Merged

Fix reader for Unicode code points over 0xFFFF #351

merged 1 commit into from Dec 8, 2019

Commits on Nov 20, 2019

  1. Fix reader for Unicode code points over 0xFFFF

    This patch fixes the handling of inputs with Unicode code points over
    0xFFFF when running on a Python 2 that does not have UCS-4 support
    (which certain distributions still ship, e.g. macOS).
    
    When Python is compiled without UCS-4 support, it uses UCS-2. In this
    situation, non-BMP Unicode characters, which have code points over
    0xFFFF, are represented as surrogate pairs. For example, if we take
    u'\U0001f3d4', it will be represented as the surrogate pair
    u'\ud83c\udfd4'. This can be seen by running, for example:
    
        [i for i in u'\U0001f3d4']
    
    In PyYAML, the reader uses a function `check_printable` to validate
    inputs, making sure that they only contain printable characters. Prior
    to this patch, on UCS-2 builds, it incorrectly identified surrogate
    pairs as non-printable.
    
    It would be fairly natural to write a regular expression that captures
    strings that contain only *printable* characters, as opposed to
    *non-printable* characters (as identified by the old code, so not
    excluding surrogate pairs):
    
        PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$')
    
    Adding support for surrogate pairs to this would be straightforward,
    adding the option of having a surrogate high followed by a surrogate low
    (`[\uD800-\uDBFF][\uDC00-\uDFFF]`):
    
        PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$')
    
    Then, this regex could be used as follows:
    
        def check_printable(self, data):
            if not self.PRINTABLE.match(data):
                raise ReaderError(...)
    
    However, matching printable strings, rather than searching for
    non-printable characters as the code currently does, would have the
    disadvantage of not identifying the culprit character (we wouldn't get
    the position and the actual non-printable character from a lack of a
    regex match).
    
    Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate
    pairs. We do this by removing surrogate pairs from the existing
    character set and adding the following options for illegal uses of
    surrogate code points:
    
    - Surrogate low that doesn't follow a surrogate high (either a surrogate
      low at the start of a string, or a surrogate low that follows a
      character that's not a surrogate high):
    
        (?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]
    
    - Surrogate high that isn't followed by a surrogate low (either a
      surrogate high at the end of a string, or a surrogate high that is
      followed by a character that's not a surrogate low):
    
        [\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)
    
    The behavior of this modified regex should match the one that is used
    when Python is built with UCS-4 support.
    anishathalye committed Nov 20, 2019
    Copy the full SHA
    937f207 View commit details
    Browse the repository at this point in the history