Surrogates fix fails tests with PyPy on Windows #552

Hmm, interesting. And it's not just the wchar_t merging behaviour I mentioned in the PR, but all lone surrogates seem to get replaced with U+FFFD. That's odd.

JustAnotherArchivist · 2022-06-17T03:43:36Z

This smells like a bug in PyPy. I have verified that decode_string is performing exactly as expected. So for example in the test case with input "\\uD83D\\uD83D\\uDCA9", the result is a three-element wchar_t with values D83D, D83D, and DCA9 on 16-bit wchar_t platforms. When this is fed to CPython's PyUnicode_FromWideChar, it results in the expected output, a lone surrogate and U+1F4A9. PyPy instead produces U+FFFD and U+1F4A9. However, based on a glance at the relevant source code, I couldn't tell where the replacement occurs.

The fix is likely the same as for that broken wchar_t merging of raw surrogates anyway: moving away from wchar_t entirely. Py_UCS4 and PyUnicode_FromKindAndData seem like the obvious route; that interface did not exist yet when the wchar_t code was written. I'll give this a try soon. Will be interesting to see its performance as well.

JustAnotherArchivist · 2022-06-17T15:40:44Z

I have a working implementation with Py_UCS4 which passes all tests (including the ones that were skipped in #550) on windows-latest and ubuntu-latest with CPython 3.10 and PyPy 3.8. The code is very messy currently though; in particular, it imports Python.h in the lib to get the Py_UCS4 typedef. It's an unsigned int of at least 32 bit, and while the obvious type for that is uint32_t (and that is indeed what CPython uses), there's no guarantee of that. Any ideas how to best approach that?

bwoodsend · 2022-06-17T21:33:36Z

I believe that abandoning wchar_t in favour of uint32_t is the trendy, albeit reluctant, choice to make in order to make Windows behave with unicode. I personally don't mind if we just hardcode the assumption that Py_UCS4 is uint32_t and either use uint32_t directly in lib or have our own typedef JSUCHAR uint32_t in lib which we use instead. I can't think of any good reason why Python would want to set Py_USC4 to anything else.

PyPy's Windows unicode handling transpired to be significantly different enough that a PyPy+Windows specific test failure slipped in (ultrajson#552).

This feature needs fixing. See ultrajson#552.

JustAnotherArchivist · 2022-06-17T21:45:55Z

Well, PyPy has a typedef unsigned int Py_UCS4 for some reason (and either assumes that int is at least 32 bit or simply doesn't support other platforms, I guess), so there's that...

Maybe we can add a clever check somewhere to ensure that the types match during compilation. Something like this, functionally (as this doesn't actually work, I think):

#if sizeof(Py_UCS4) != sizeof(uint32_t)
#error "Py_UCS4 does not match uint32_t
#endif

This feature needs fixing. See ultrajson#552.

bwoodsend · 2022-06-17T21:59:08Z

Well, PyPy has a typedef unsigned int Py_UCS4 for some reason (and either assumes that int is at least 32 bit or simply doesn't support other platforms, I guess), so there's that...

What? Why? [splutter splutter splutter]

Yeah, that check would be useful. It don't think the test failures would be that self explainatory if Py_UCS4 became something else.

JustAnotherArchivist · 2022-06-17T22:46:21Z

So I haven't done assertions at compile-time before. The clean way to do that is _Static_assert, which was added only in C11. CPython 3.11 requires C11 anyway, but previous versions don't, so we probably shouldn't rely on that just yet. There's a wide range of hacks to emulate it in older C, some of which no longer work. I think I'll go with the one that's also used in Linux (specifically in vbox_vmmdev_types.h):

typedef char assert_py_ucs4_is_jsuint32[1 - 2*!!(sizeof(Py_UCS4) == sizeof(JSUINT32))];

Slightly cursed, but hey, at least it works. :-)

bwoodsend · 2022-06-17T23:47:38Z

That's so ugly that it's comical. We've got to have it! (Honestly, I don't even want to know how that works.)

This feature needs fixing. See ultrajson#552.

JustAnotherArchivist · 2022-06-18T21:07:17Z

The funny thing is that it's actually one of the cleaner ways to do this. I suppose (condition) ? 1 : -1 would make the meaning more obvious, but that's no fun. !! negates the result of the comparison twice to get a 0 (if the comparison evaluates to false) or 1 (if it's true). Then this tries to define an array type with either size 1 or -1. The latter is obviously an error, and the compiler gets angry. And if it passes and the type is never referenced anywhere, it should get optimised out and never end up in the binary. The abyss stared back...

There's an older, uglier version:

#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))

The actual error is the exact same reasoning, and the (void)sizeof(...) thing is there for ... reasons. This no longer compiles with current GCC though (fortunately).

Aanyway, enough about this. I'll prepare a PR.

Edit: Actually had the logic messed up there; a true comparison is equal to 1, a false one 0. The condition is what must be true for compilation to fail in this ... thing. A single exclamation mark should invert the meaning.

This fixes character handling on platforms with 16-bit wchar_t (notably, Windows), which was broken (in different ways) on both CPython and PyPy. Fixes ultrajson#552

hugovk added the bug Something isn't working label Jun 16, 2022

hugovk mentioned this issue Jun 17, 2022

Add support for arbitrary size integers #548

Merged

bwoodsend added a commit to bwoodsend/ultrajson that referenced this issue Jun 17, 2022

xfail test_decode_surrogate_characters() on Windows PyPy.

da9ac0b

This feature needs fixing. See ultrajson#552.

bwoodsend mentioned this issue Jun 17, 2022

Re-enable testing on PyPy for all platforms. #553

Merged

bwoodsend added a commit to bwoodsend/ultrajson that referenced this issue Jun 17, 2022

xfail test_decode_surrogate_characters() on Windows PyPy.

53c0864

This feature needs fixing. See ultrajson#552.

bwoodsend added a commit to bwoodsend/ultrajson that referenced this issue Jun 18, 2022

xfail test_decode_surrogate_characters() on Windows PyPy.

404de1a

This feature needs fixing. See ultrajson#552.

JustAnotherArchivist mentioned this issue Jun 18, 2022

Replace wchar_t string decoding implementation with a uint32_t-based one #555

Merged

hugovk closed this as completed in #555 Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surrogates fix fails tests with PyPy on Windows #552

Surrogates fix fails tests with PyPy on Windows #552

hugovk commented Jun 16, 2022

JustAnotherArchivist commented Jun 16, 2022

JustAnotherArchivist commented Jun 17, 2022

JustAnotherArchivist commented Jun 17, 2022

bwoodsend commented Jun 17, 2022

JustAnotherArchivist commented Jun 17, 2022 •

edited

bwoodsend commented Jun 17, 2022

JustAnotherArchivist commented Jun 17, 2022

bwoodsend commented Jun 17, 2022

JustAnotherArchivist commented Jun 18, 2022 •

edited

Surrogates fix fails tests with PyPy on Windows #552

Surrogates fix fails tests with PyPy on Windows #552

Comments

hugovk commented Jun 16, 2022

What did you do?

What did you expect to happen?

What actually happened?

What versions are you using?

JustAnotherArchivist commented Jun 16, 2022

JustAnotherArchivist commented Jun 17, 2022

JustAnotherArchivist commented Jun 17, 2022

bwoodsend commented Jun 17, 2022

JustAnotherArchivist commented Jun 17, 2022 • edited

bwoodsend commented Jun 17, 2022

JustAnotherArchivist commented Jun 17, 2022

bwoodsend commented Jun 17, 2022

JustAnotherArchivist commented Jun 18, 2022 • edited

JustAnotherArchivist commented Jun 17, 2022 •

edited

JustAnotherArchivist commented Jun 18, 2022 •

edited