New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for arbitrary size integers #548
Conversation
static JSOBJ Object_newIntegerFromString(void *prv, char *value, size_t length) | ||
{ | ||
// PyLong_FromString requires a NUL-terminated string in CPython, contrary to the documentation: https://github.com/python/cpython/issues/59200 | ||
char *buf = PyObject_Malloc(length + 1); | ||
memcpy(buf, value, length); | ||
buf[length] = '\0'; | ||
return PyLong_FromString(buf, NULL, 10); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memcpy
here is certainly not ideal. It would probably be faster to buffer the next byte, replace it with a NUL, feed the char array through PyLong_FromString
, and then restore it. I think that should always be possible/safe without buffer overruns, but I haven't spent a great deal of thought about it yet. Do you think it's worth exploring that sort of hack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, now that's a disgusting but tantalising thing to do. Two ways I imagine that causing damage are:
- The integer we're parsing is the last part of the JSON so the next byte would be out of bounds.
- Someone is reading the JSON string in multiple threads and the second thread happens to look at that byte when it's temporarily nullified.
1 is moot if the string we're parsing is NULL terminated (which I'm fairly certain that they already are). 2 feels like a very nasty way to derail something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relevant part of the decoder boils down to:
- If the argument is a string, call
PyUnicode_AsUTF8String
to convert tobytes
. - Call
PyBytes_AsString
, which returns a NULL-terminatedchar*
that is the internal buffer of thebytes
object. - Iterate over that buffer.
So no out-of-bounds issue. As for multithreading, yes, that could be an issue (if the user passes bytes
). But on the other hand, ujson already isn't thread-safe as far as I can tell. For example, modifying a dict from another thread while encoding it should crash the encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some tests on this, parsing an array equivalent to [2**64 + i for i in range(1000)]
, i.e. 1000 ints all using this code path. On my machine, parsing such a string averages 128 μs with memcpy
and 108 μs with the suggested hack.
Compared to the other code paths for smaller ints (note that the number of digits obviously has a direct impact on throughput as well):
input | time |
---|---|
2**64 + i , memcpy |
128 μs |
2**64 + i , buffer hack |
108 μs |
2**62 + i (newUnsignedLong ) |
60 μs |
-2**63 + i (newLong ) |
64 μs |
2**30 + i (newInt ) |
45 μs |
-2**31 + i (newInt ) |
45 μs |
Command, for reference: python3 -m timeit -s 'import ujson; s = "[" + ",".join(str(2**64 + i) for i in range(1000)) + "]"' 'ujson.loads(s)'
Implementation of the hack:
static JSOBJ Object_newIntegerFromString(void *prv, char *value, size_t length)
{
char buf = value[length];
value[length] = '\0';
PyObject *out = PyLong_FromString(value, NULL, 10);
value[length] = buf;
return out;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a pretty small speedup really. On my machine it's even less (115μs vs 110μs). Definitely not worth the hack in my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree. It's basically only relevant if all you ever do is parse huge integers. I don't think that's common enough to warrant the ugly hack.
@bwoodsend Small comment on the previous discussion in #440 just to avoid confusion: I realised that an extra type isn't needed. |
Just to confirm, I did more performance checks on both encoding and decoding. As expected, throughput for ints within the previously supported range |
b33ec84
to
5c9cfa8
Compare
Codecov Report
@@ Coverage Diff @@
## main #548 +/- ##
==========================================
- Coverage 91.83% 91.80% -0.03%
==========================================
Files 6 6
Lines 1824 1855 +31
==========================================
+ Hits 1675 1703 +28
- Misses 149 152 +3
Continue to review full report at Codecov.
|
Apparently PyPy's Edit: Issue filed: https://foss.heptapod.net/pypy/pypy/-/issues/3765 |
044701d
to
655e146
Compare
Before merging, let's also test on the CI with PyPy on Windows. Obviously #552 is a blocker, so we can either wait for that to be fixed, or could temporarily skip its failing test on PyPy/Windows to enable testing here first? |
ab21a38
to
99709df
Compare
xfailing the offending test on Windows+PyPy, re-enabling CI for PyPy on either Windows or all platforms then rebasing this PR on top of that would be my plan. |
Either way is fine with me. |
Sounds good: #553 |
#553 is merged, we should be good to merge this too. Thanks both! |
This implements the idea I mentioned in #440: when an int is too large to be handled with C integers, do the conversion with Python's C functions instead and treat it as a char array internally.
The code already works, but it's obviously not ready for merging. I'm submitting it now for feedback.
I only looked at performance briefly so far. As the code is now, it shouldn't noticeably impact throughput when no large ints are in the input (just one additional jump on encoding and a bigger jump on decoding, I think). Given how complex they are compared to ujson's implementation, I was surprised to find that the CPython functions actually perform very well. On encoding, with some simple
python3 -m timeit -s 'import ujson' 'ujson.dumps(i)'
tests, I found the following:i
0
9
10
256
257
int
objects is not relevant.65536
9223372036854775807
(2**63-1
)long long
9223372036854775808
18446744073709551615
(2**64-1
)unsigned long long
18446744073709551616
I also briefly looked into getting rid of the
PyLong_AsLongLong
andPyLong_AsUnsignedLongLong
calls entirely and just using this method for everything (though I didn't clean everything up, just commented out that code inObject_beginTypeContext
). For]-2**63, 2**63[
, this is slower than the existing code. (The range[-9, 9]
is marginally faster because CPython has optimised code for that.) Outside of that range, it performs better. Specifically,ujson.dumps(9223372036854775808)
takes about 395 ns on my machine then, 20 % less than the existing code. I haven't thoroughly tested removing only the unsigned call, but it seems to not be of any value. In other words, if someone were to encode lots of large ints, this might be a considerable performance increase for them, and if most ints were single-digit, it would be about equal. For others, it would be significantly worse. I'm not sure what to do with this insight.