Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surrogates fix fails tests with PyPy on Windows #552

Closed
hugovk opened this issue Jun 16, 2022 · 9 comments · Fixed by #555
Closed

Surrogates fix fails tests with PyPy on Windows #552

hugovk opened this issue Jun 16, 2022 · 9 comments · Fixed by #555
Labels
bug Something isn't working

Comments

@hugovk
Copy link
Member

hugovk commented Jun 16, 2022

What did you do?

Built wheels on CI using https://github.com/ultrajson/ultrajson/blob/main/.github/workflows/deploy.yml.

What did you expect to happen?

All wheels built.

What actually happened?

Failed to build PyPy wheels on Windows due to test_decode_surrogate_characters failures:

Before PR #550 was merged, wheels were successfully built.

What versions are you using?

PR #547 is partly to blame for removing PyPy + Windows from the CI.


Temporarily enabling a full Python x OS matrix, all pass before #550:

https://github.com/hugovk/ultrajson/actions/runs/2511636309

After the merge, all pass except pypy-3.7 and pypy3.8 on windows-latest:

https://github.com/hugovk/ultrajson/actions/runs/2511595253

@hugovk hugovk added the bug Something isn't working label Jun 16, 2022
@JustAnotherArchivist
Copy link
Collaborator

Hmm, interesting. And it's not just the wchar_t merging behaviour I mentioned in the PR, but all lone surrogates seem to get replaced with U+FFFD. That's odd.

@JustAnotherArchivist
Copy link
Collaborator

This smells like a bug in PyPy. I have verified that decode_string is performing exactly as expected. So for example in the test case with input "\\uD83D\\uD83D\\uDCA9", the result is a three-element wchar_t with values D83D, D83D, and DCA9 on 16-bit wchar_t platforms. When this is fed to CPython's PyUnicode_FromWideChar, it results in the expected output, a lone surrogate and U+1F4A9. PyPy instead produces U+FFFD and U+1F4A9. However, based on a glance at the relevant source code, I couldn't tell where the replacement occurs.

The fix is likely the same as for that broken wchar_t merging of raw surrogates anyway: moving away from wchar_t entirely. Py_UCS4 and PyUnicode_FromKindAndData seem like the obvious route; that interface did not exist yet when the wchar_t code was written. I'll give this a try soon. Will be interesting to see its performance as well.

@JustAnotherArchivist
Copy link
Collaborator

I have a working implementation with Py_UCS4 which passes all tests (including the ones that were skipped in #550) on windows-latest and ubuntu-latest with CPython 3.10 and PyPy 3.8. The code is very messy currently though; in particular, it imports Python.h in the lib to get the Py_UCS4 typedef. It's an unsigned int of at least 32 bit, and while the obvious type for that is uint32_t (and that is indeed what CPython uses), there's no guarantee of that. Any ideas how to best approach that?

@bwoodsend
Copy link
Collaborator

I believe that abandoning wchar_t in favour of uint32_t is the trendy, albeit reluctant, choice to make in order to make Windows behave with unicode. I personally don't mind if we just hardcode the assumption that Py_UCS4 is uint32_t and either use uint32_t directly in lib or have our own typedef JSUCHAR uint32_t in lib which we use instead. I can't think of any good reason why Python would want to set Py_USC4 to anything else.

bwoodsend added a commit to bwoodsend/ultrajson that referenced this issue Jun 17, 2022
PyPy's Windows unicode handling transpired to be significantly different
enough that a PyPy+Windows specific test failure slipped in (ultrajson#552).
bwoodsend added a commit to bwoodsend/ultrajson that referenced this issue Jun 17, 2022
@JustAnotherArchivist
Copy link
Collaborator

JustAnotherArchivist commented Jun 17, 2022

Well, PyPy has a typedef unsigned int Py_UCS4 for some reason (and either assumes that int is at least 32 bit or simply doesn't support other platforms, I guess), so there's that...

Maybe we can add a clever check somewhere to ensure that the types match during compilation. Something like this, functionally (as this doesn't actually work, I think):

#if sizeof(Py_UCS4) != sizeof(uint32_t)
#error "Py_UCS4 does not match uint32_t
#endif

bwoodsend added a commit to bwoodsend/ultrajson that referenced this issue Jun 17, 2022
@bwoodsend
Copy link
Collaborator

Well, PyPy has a typedef unsigned int Py_UCS4 for some reason (and either assumes that int is at least 32 bit or simply doesn't support other platforms, I guess), so there's that...

What? Why? [splutter splutter splutter]


Yeah, that check would be useful. It don't think the test failures would be that self explainatory if Py_UCS4 became something else.

@JustAnotherArchivist
Copy link
Collaborator

So I haven't done assertions at compile-time before. The clean way to do that is _Static_assert, which was added only in C11. CPython 3.11 requires C11 anyway, but previous versions don't, so we probably shouldn't rely on that just yet. There's a wide range of hacks to emulate it in older C, some of which no longer work. I think I'll go with the one that's also used in Linux (specifically in vbox_vmmdev_types.h):

typedef char assert_py_ucs4_is_jsuint32[1 - 2*!!(sizeof(Py_UCS4) == sizeof(JSUINT32))];

Slightly cursed, but hey, at least it works. :-)

@bwoodsend
Copy link
Collaborator

That's so ugly that it's comical. We've got to have it! (Honestly, I don't even want to know how that works.)

bwoodsend added a commit to bwoodsend/ultrajson that referenced this issue Jun 18, 2022
@JustAnotherArchivist
Copy link
Collaborator

JustAnotherArchivist commented Jun 18, 2022

The funny thing is that it's actually one of the cleaner ways to do this. I suppose (condition) ? 1 : -1 would make the meaning more obvious, but that's no fun. !! negates the result of the comparison twice to get a 0 (if the comparison evaluates to false) or 1 (if it's true). Then this tries to define an array type with either size 1 or -1. The latter is obviously an error, and the compiler gets angry. And if it passes and the type is never referenced anywhere, it should get optimised out and never end up in the binary. The abyss stared back...

There's an older, uglier version:

#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))

The actual error is the exact same reasoning, and the (void)sizeof(...) thing is there for ... reasons. This no longer compiles with current GCC though (fortunately).

Aanyway, enough about this. I'll prepare a PR.

Edit: Actually had the logic messed up there; a true comparison is equal to 1, a false one 0. The condition is what must be true for compilation to fail in this ... thing. A single exclamation mark should invert the meaning.

JustAnotherArchivist added a commit to JustAnotherArchivist/ultrajson that referenced this issue Jun 18, 2022
This fixes character handling on platforms with 16-bit wchar_t (notably, Windows), which was broken (in different ways) on both CPython and PyPy.

Fixes ultrajson#552
JustAnotherArchivist added a commit to JustAnotherArchivist/ultrajson that referenced this issue Jun 18, 2022
This fixes character handling on platforms with 16-bit wchar_t (notably, Windows), which was broken (in different ways) on both CPython and PyPy.

Fixes ultrajson#552
JustAnotherArchivist added a commit to JustAnotherArchivist/ultrajson that referenced this issue Jun 19, 2022
This fixes character handling on platforms with 16-bit wchar_t (notably, Windows), which was broken (in different ways) on both CPython and PyPy.

Fixes ultrajson#552
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants