New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of surrogates on decoding #550
Merged
hugovk
merged 1 commit into
ultrajson:main
from
JustAnotherArchivist:fix-decode-surrogates
Jun 16, 2022
Merged
Fix handling of surrogates on decoding #550
hugovk
merged 1 commit into
ultrajson:main
from
JustAnotherArchivist:fix-decode-surrogates
Jun 16, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This implements surrogate handling on decoding as it is in the standard library. Lone escaped surrogates and any raw surrogates in the input result in surrogates in the output, and escaped surrogate pairs get decoded into non-BMP characters. Note that raw surrogate pairs get treated differently on platforms/compilers with 16-bit `wchar_t`, e.g. Microsoft Windows.
Some quick benchmarks (main = 4ac30c9 vs e0e5db9):
The raw surrogate timing is a bit surprising to me since nothing in that code path changed. But it is reproducibly faster. |
bwoodsend
approved these changes
Jun 15, 2022
adrianeboyd
added a commit
to adrianeboyd/srsly
that referenced
this pull request
Jul 20, 2022
adrianeboyd
added a commit
to adrianeboyd/srsly
that referenced
this pull request
Jul 20, 2022
adrianeboyd
added a commit
to adrianeboyd/srsly
that referenced
this pull request
Jul 20, 2022
adrianeboyd
added a commit
to explosion/srsly
that referenced
this pull request
Jul 20, 2022
adrianeboyd
added a commit
to adrianeboyd/srsly
that referenced
this pull request
Jul 18, 2023
adrianeboyd
added a commit
to adrianeboyd/srsly
that referenced
this pull request
Jul 18, 2023
adrianeboyd
added a commit
to adrianeboyd/srsly
that referenced
this pull request
Jul 18, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This implements surrogate handling on decoding as it is in the standard library. Lone escaped surrogates and any raw surrogates in the input result in surrogates in the output, and escaped surrogate pairs get decoded into non-BMP characters. Note that raw surrogate pairs get treated differently on platforms/compilers with 16-bit
wchar_t
, e.g. Microsoft Windows.Before this, well-formed JSON using surrogates only in encoded surrogate pairs was decoded correctly, but anything else containing surrogates would produce unexpected results or errors. This change makes ujson's decoding compatible with the standard library's.
Unfortunately, platforms with a 16-bit
wchar_t
cannot handle raw surrogate pairs correctly becausePyUnicode_FromWideChar
decodes those. That issue was always present (but untested) and is left unfixed here. I'm not sure it is possible to handle it correctly without completely changing the decoding approach, e.g. producing UTF-8 and usingPyUnicode_FromStringAndSize
(which, by the way, is what orjson does, though it rejects lone surrogates).