Fix handling of surrogates on decoding #550

JustAnotherArchivist · 2022-06-09T18:14:33Z

This implements surrogate handling on decoding as it is in the standard library. Lone escaped surrogates and any raw surrogates in the input result in surrogates in the output, and escaped surrogate pairs get decoded into non-BMP characters. Note that raw surrogate pairs get treated differently on platforms/compilers with 16-bit wchar_t, e.g. Microsoft Windows.

Before this, well-formed JSON using surrogates only in encoded surrogate pairs was decoded correctly, but anything else containing surrogates would produce unexpected results or errors. This change makes ujson's decoding compatible with the standard library's.

Unfortunately, platforms with a 16-bit wchar_t cannot handle raw surrogate pairs correctly because PyUnicode_FromWideChar decodes those. That issue was always present (but untested) and is left unfixed here. I'm not sure it is possible to handle it correctly without completely changing the decoding approach, e.g. producing UTF-8 and using PyUnicode_FromStringAndSize (which, by the way, is what orjson does, though it rejects lone surrogates).

This implements surrogate handling on decoding as it is in the standard library. Lone escaped surrogates and any raw surrogates in the input result in surrogates in the output, and escaped surrogate pairs get decoded into non-BMP characters. Note that raw surrogate pairs get treated differently on platforms/compilers with 16-bit `wchar_t`, e.g. Microsoft Windows.

JustAnotherArchivist · 2022-06-09T18:46:42Z

Some quick benchmarks (main = 4ac30c9 vs e0e5db9):

Encoded surrogates: ujson.loads('"' + '\\uD83D\\uDCA9' * 1000 + '"') – slightly faster (16.8 μs with main, 16.4 μs with this)
Raw surrogates: ujson.loads(('"' + '\uD83D\uDCA9' * 1000 + '"').encode('utf-8', 'surrogatepass')) – slightly faster (6.55 μs with main, 6.30 μs with this)
ujson.loads('"' + 'a\\uD83D\\uDCA9' * 1000 + '"') (ASCII + encoded surrogates) – slightly faster (20.0 μs with main, 19.7 μs with this)
ASCII: ujson.loads('"' + 'a' * 10000 + '"') – same (20.7 μs)
UTF-8: ujson.loads('"' + '\U0001F4A9' * 1000 + '"') – same (6.4 μs)

The raw surrogate timing is a bit surprising to me since nothing in that code path changed. But it is reproducibly faster.
Obviously, since the lone surrogates were broken in various ways, it's not possible to compare that.

Backport ultrajson/ultrajson#550

bwoodsend approved these changes Jun 15, 2022

View reviewed changes

hugovk added the changelog: Fixed For any bug fixes label Jun 16, 2022

hugovk merged commit b47c3a7 into ultrajson:main Jun 16, 2022

hugovk mentioned this pull request Jun 16, 2022

Surrogates fix fails tests with PyPy on Windows #552

Closed

JustAnotherArchivist mentioned this pull request Jun 18, 2022

Replace wchar_t string decoding implementation with a uint32_t-based one #555

Merged

sync-by-unito bot mentioned this pull request Jul 11, 2022

Bump ujson from 4.3.0 to 5.4.0 in /sample-projects/streaming-audio/FastAPI/live-transcription-fastapi deepgram/deepgram-python-sdk#28

Closed

adrianeboyd added a commit to adrianeboyd/srsly that referenced this pull request Jul 20, 2022

Backport "Fix handling of surrogates on decoding"

5880cbe

Backport ultrajson/ultrajson#550

adrianeboyd added a commit to adrianeboyd/srsly that referenced this pull request Jul 20, 2022

Backport "Fix handling of surrogates on decoding"

953eafc

Backport ultrajson/ultrajson#550

adrianeboyd added a commit to adrianeboyd/srsly that referenced this pull request Jul 20, 2022

Backport "Fix handling of surrogates on decoding"

9ac2789

Backport ultrajson/ultrajson#550

adrianeboyd mentioned this pull request Jul 20, 2022

Backport "Fix handling of surrogates on decoding" explosion/srsly#66

Merged

adrianeboyd added a commit to explosion/srsly that referenced this pull request Jul 20, 2022

Backport "Fix handling of surrogates on decoding" (#66)

9910607

Backport ultrajson/ultrajson#550

adrianeboyd added a commit to adrianeboyd/srsly that referenced this pull request Jul 18, 2023

Backport "Fix handling of surrogates on decoding" (explosion#66)

2f99b03

Backport ultrajson/ultrajson#550

adrianeboyd added a commit to adrianeboyd/srsly that referenced this pull request Jul 18, 2023

Backport "Fix handling of surrogates on decoding" (explosion#66)

eeb0010

Backport ultrajson/ultrajson#550

adrianeboyd added a commit to adrianeboyd/srsly that referenced this pull request Jul 18, 2023

Backport "Fix handling of surrogates on decoding" (explosion#66)

3fe6d47

Backport ultrajson/ultrajson#550

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of surrogates on decoding #550

Fix handling of surrogates on decoding #550

JustAnotherArchivist commented Jun 9, 2022

JustAnotherArchivist commented Jun 9, 2022

Fix handling of surrogates on decoding #550

Fix handling of surrogates on decoding #550

Conversation

JustAnotherArchivist commented Jun 9, 2022

JustAnotherArchivist commented Jun 9, 2022