Replace `wchar_t` string decoding implementation with a `uint32_t`-based one #555

JustAnotherArchivist · 2022-06-18T23:53:27Z

This fixes character handling on platforms with 16-bit wchar_t (notably, Windows), which was broken (in different ways) on both CPython and PyPy.

Fixes #552

Remarks:

For the disgusting Py_UCS4 == JSUINT32 check magic, see the comments on Surrogates fix fails tests with PyPy on Windows #552.
The changelog might need a little touch-up here as this is essentially a continuation of Fix handling of surrogates on decoding #550.
I have not run any performance comparisons yet. In general, I would expect it to perform at least as well as the previous implementation since PyUnicode_FromWideChar does some extra work compared to PyUnicode_FromKindAndData (mostly due to surrogate handling). On 16-bit wchar_t platforms, the larger buffer size might have some impact though; I won't be able to run comparisons for that though, I think.

codecov-commenter · 2022-06-18T23:54:48Z

Codecov Report

Merging #555 (bc7bdff) into main (cc70119) will increase coverage by 0.03%.
The diff coverage is 90.90%.

@@            Coverage Diff             @@
##             main     #555      +/-   ##
==========================================
+ Coverage   91.81%   91.84%   +0.03%     
==========================================
  Files           6        6              
  Lines        1856     1852       -4     
==========================================
- Hits         1704     1701       -3     
+ Misses        152      151       -1

Impacted Files	Coverage Δ
tests/test_ujson.py	`99.45% <ø> (+0.17%)`	⬆️
lib/ultrajsondec.c	`92.56% <90.00%> (ø)`
python/JSONtoObj.c	`88.04% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc70119...bc7bdff. Read the comment docs.

lib/ultrajsondec.c

This fixes character handling on platforms with 16-bit wchar_t (notably, Windows), which was broken (in different ways) on both CPython and PyPy. Fixes ultrajson#552

bwoodsend

Nice. I was expecting a replacement of all strings to be a much bigger, scarier looking change set.

JustAnotherArchivist · 2022-06-20T18:24:11Z

Yeah, much of the code essentially assumed 32-bit ints already for proper operation, so not many changes were needed at all.

Also, just realised I forgot about the benchmarks. Some quick tests right now indicate that it's very marginally faster than the previous code by a couple per cent or so.

…nt32_t`-based one" Backport ultrajson/ultrajson#555

…nt32_t`-based one" (#67) Backport ultrajson/ultrajson#555

…nt32_t`-based one" (explosion#67) Backport ultrajson/ultrajson#555

bwoodsend reviewed Jun 19, 2022

View reviewed changes

lib/ultrajsondec.c Outdated Show resolved Hide resolved

Replace wchar_t string decoding implementation with a uint32_t-based one

bc7bdff

This fixes character handling on platforms with 16-bit wchar_t (notably, Windows), which was broken (in different ways) on both CPython and PyPy. Fixes ultrajson#552

JustAnotherArchivist force-pushed the fix-decode-surrogates-2 branch from eb9c5c1 to bc7bdff Compare June 19, 2022 23:11

hugovk added the changelog: Fixed For any bug fixes label Jun 20, 2022

bwoodsend approved these changes Jun 20, 2022

View reviewed changes

hugovk merged commit 67ec071 into ultrajson:main Jun 20, 2022

sync-by-unito bot mentioned this pull request Jul 11, 2022

Bump ujson from 4.3.0 to 5.4.0 in /sample-projects/streaming-audio/FastAPI/live-transcription-fastapi deepgram/deepgram-python-sdk#28

Closed

jhe921 mentioned this pull request Jul 14, 2022

Update UltraJSON explosion/srsly#65

Closed

adrianeboyd added a commit to adrianeboyd/srsly that referenced this pull request Jul 20, 2022

Backport "Replace wchar_t string decoding implementation with a `ui…

c2b4432

…nt32_t`-based one" Backport ultrajson/ultrajson#555

adrianeboyd mentioned this pull request Jul 20, 2022

Backport "Replace wchar_t string decoding implementation with a uint32_t-based one" explosion/srsly#67

Merged

adrianeboyd added a commit to explosion/srsly that referenced this pull request Jul 20, 2022

Backport "Replace wchar_t string decoding implementation with a `ui…

febb6f2

…nt32_t`-based one" (#67) Backport ultrajson/ultrajson#555

JustAnotherArchivist mentioned this pull request Dec 19, 2022

Supporting builds with Python's Limited API #574

Closed

adrianeboyd added a commit to adrianeboyd/srsly that referenced this pull request Jul 18, 2023

Backport "Replace wchar_t string decoding implementation with a `ui…

96196ec

…nt32_t`-based one" (explosion#67) Backport ultrajson/ultrajson#555

adrianeboyd added a commit to adrianeboyd/srsly that referenced this pull request Jul 18, 2023

Backport "Replace wchar_t string decoding implementation with a `ui…

b278709

…nt32_t`-based one" (explosion#67) Backport ultrajson/ultrajson#555

adrianeboyd added a commit to adrianeboyd/srsly that referenced this pull request Jul 18, 2023

Backport "Replace wchar_t string decoding implementation with a `ui…

54c69db

…nt32_t`-based one" (explosion#67) Backport ultrajson/ultrajson#555

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `wchar_t` string decoding implementation with a `uint32_t`-based one #555

Replace `wchar_t` string decoding implementation with a `uint32_t`-based one #555

JustAnotherArchivist commented Jun 18, 2022

codecov-commenter commented Jun 18, 2022 •

edited

bwoodsend left a comment

JustAnotherArchivist commented Jun 20, 2022

Replace wchar_t string decoding implementation with a uint32_t-based one #555

Replace wchar_t string decoding implementation with a uint32_t-based one #555

Conversation

JustAnotherArchivist commented Jun 18, 2022

codecov-commenter commented Jun 18, 2022 • edited

Codecov Report

bwoodsend left a comment

Choose a reason for hiding this comment

JustAnotherArchivist commented Jun 20, 2022

Replace `wchar_t` string decoding implementation with a `uint32_t`-based one #555

Replace `wchar_t` string decoding implementation with a `uint32_t`-based one #555

codecov-commenter commented Jun 18, 2022 •

edited