Fix handling of surrogate pseudocharacters under Python 3. #284

gnprice · 2017-08-29T02:09:33Z

This is a situation where we have a Python unicode string which doesn't
consist entirely of genuine Unicode characters -- some of the codepoints
in the string are surrogate codepoints, which occur in a UTF-16 encoding
of a string and were also repurposed in PEP 383 for losslessly encoding
arbitrary mostly-UTF-8 bytestrings (like Unix filenames) in Python
strings. Currently, on Python 3, we cause a UnicodeEncodeError if we
try to encode such a string as JSON.

It's not 100% obvious what the right thing to do here is -- this
situation seems like it must reflect a bug somewhere else in the
program or its environment. But

one way we can get such a string is by loading a JSON document
(perhaps an invalid JSON document? anyway, we load it without error):

>>> ujson.dumps(ujson.loads('"\\udcff"'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 0: surrogates not allowed

we already pass these strings through without complaint on Python 2;
as the included test shows, passing these through matches the
behavior of the stdlib's json module.

So it seems best to pass them through.

Fixes #156.

This is a situation where we have a Python unicode string which doesn't consist entirely of genuine Unicode characters -- some of the codepoints in the string are surrogate codepoints, which occur in a UTF-16 encoding of a string and were also repurposed in PEP 383 for losslessly encoding arbitrary mostly-UTF-8 bytestrings (like Unix filenames) in Python strings. Currently, on Python 3, we cause a UnicodeEncodeError if we try to encode such a string as JSON. It's not 100% obvious what the right thing to do here is -- this situation seems like it must reflect a bug somewhere else in the program or its environment. But * one way we can get such a string is by loading a JSON document (perhaps an invalid JSON document? anyway, we load it without error): >>> ujson.dumps(ujson.loads('"\\udcff"')) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 0: surrogates not allowed * we already pass these strings through without complaint on Python 2; * as the included test shows, passing these through matches the behavior of the stdlib's `json` module. So it seems best to pass them through. Fixes ultrajson#156.

See my PR upstream: ultrajson/ultrajson#284 . Fixes #6332.

hartwork

I'm not sure if passing through is the best approach — stdlib json does not pass through but escapes (avoiding invalid characters in the output), see:

In [11]: list(sys.version_info)
Out[11]: [3, 6, 10, 'final', 0]

In [12]: json.dumps('\udcff')
Out[12]: '"\\udcff"'

hartwork · 2020-02-25T21:14:37Z

python/py_defines.h

+#define PyUnicode_AsUTF8String(o) \
+    (PyUnicode_AsEncodedString((o), "utf-8", "surrogatepass"))
+


This code seems unused?

If you're aiming for surrogatepass as a generic solution, it's a recipe for producing invalid UTF-8:

In [6]: '\udcff'.encode('utf-8', 'surrogatepass').decode('utf-8') [..] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

Are you aware?

This allows surrogates anywhere in the input, compatible with the json module from the standard library. This also refactors two interfaces: - The PyUnicode to char* conversion is moved into its own function, separated from the JSONTypeContext handling, so it can be reused for other things in the future. - Converting the char* output to a Python string with surrogates intact requires the string length for PyUnicode_Decode (or any of its alternatives). While strlen could be used, the length is already known inside the encoder, so the encoder function now also takes an extra size_t pointer argument to return that. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's __json__ method return value were to contain them. Fixes ultrajson#156 Fixes ultrajson#447 Supersedes ultrajson#284

This allows surrogates anywhere in the input, compatible with the json module from the standard library. This also refactors two interfaces: - The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context. - Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them. Fixes ultrajson#156 Fixes ultrajson#447 Supersedes ultrajson#284

This allows surrogates anywhere in the input, compatible with the json module from the standard library. This also refactors two interfaces: - The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context. - Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them. Fixes ultrajson#156 Fixes ultrajson#447 Fixes ultrajson#537 Supersedes ultrajson#284

hugovk · 2022-06-01T15:15:55Z

Superseded by #530. Thanks!

gnprice mentioned this pull request Aug 29, 2017

ujson causes UnicodeEncodeError in email mirror zulip/zulip#6332

Closed

gnprice added a commit to zulip/zulip that referenced this pull request Aug 29, 2017

py3: Bump ujson to our own fork to pick up a Python 3 fix.

b0d34b0

See my PR upstream: ultrajson/ultrajson#284 . Fixes #6332.

timabbott mentioned this pull request Nov 16, 2017

Is this project still maintained? #291

Closed

gnprice mentioned this pull request Sep 4, 2019

Evaluate switching to RapidJSON (from ujson) zulip/zulip#6507

Closed

hartwork reviewed Feb 20, 2020

View reviewed changes

hugovk added 2 commits February 25, 2020 22:23

Merge branch 'master' into master

35d8a7e

Fix lint

d55f38c

hartwork suggested changes Feb 25, 2020

View reviewed changes

JustAnotherArchivist mentioned this pull request Apr 17, 2022

Fix handling of surrogates on encoding #530

Merged

hugovk closed this Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of surrogate pseudocharacters under Python 3. #284

Fix handling of surrogate pseudocharacters under Python 3. #284

gnprice commented Aug 29, 2017 •

edited by hugovk

hartwork left a comment •

edited

hartwork Feb 25, 2020 •

edited

hugovk commented Jun 1, 2022

		#define PyUnicode_AsUTF8String(o) \
		(PyUnicode_AsEncodedString((o), "utf-8", "surrogatepass"))

Fix handling of surrogate pseudocharacters under Python 3. #284

Fix handling of surrogate pseudocharacters under Python 3. #284

Conversation

gnprice commented Aug 29, 2017 • edited by hugovk

hartwork left a comment • edited

Choose a reason for hiding this comment

hartwork Feb 25, 2020 • edited

Choose a reason for hiding this comment

hugovk commented Jun 1, 2022

gnprice commented Aug 29, 2017 •

edited by hugovk

hartwork left a comment •

edited

hartwork Feb 25, 2020 •

edited