Add support for arbitrary size integers #548

JustAnotherArchivist · 2022-06-03T18:30:52Z

This implements the idea I mentioned in #440: when an int is too large to be handled with C integers, do the conversion with Python's C functions instead and treat it as a char array internally.

The code already works, but it's obviously not ready for merging. I'm submitting it now for feedback.

I only looked at performance briefly so far. As the code is now, it shouldn't noticeably impact throughput when no large ints are in the input (just one additional jump on encoding and a bigger jump on decoding, I think). Given how complex they are compared to ujson's implementation, I was surprised to find that the CPython functions actually perform very well. On encoding, with some simple python3 -m timeit -s 'import ujson' 'ujson.dumps(i)' tests, I found the following:

`i`	time	note
`0`	255 ns
`9`	267 ns
`10`	280 ns
`256`	282 ns
`257`	282 ns	Not interned, clearly the creation of new `int` objects is not relevant.
`65536`	289 ns
`9223372036854775807` (`2**63-1`)	337 ns	Last value to make use of the `long long`
`9223372036854775808`	501 ns
`18446744073709551615` (`2**64-1`)	503 ns	Last value to make use of the `unsigned long long`
`18446744073709551616`	504 ns	First value to make use of this new implementation

I also briefly looked into getting rid of the PyLong_AsLongLong and PyLong_AsUnsignedLongLong calls entirely and just using this method for everything (though I didn't clean everything up, just commented out that code in Object_beginTypeContext). For ]-2**63, 2**63[, this is slower than the existing code. (The range [-9, 9] is marginally faster because CPython has optimised code for that.) Outside of that range, it performs better. Specifically, ujson.dumps(9223372036854775808) takes about 395 ns on my machine then, 20 % less than the existing code. I haven't thoroughly tested removing only the unsigned call, but it seems to not be of any value. In other words, if someone were to encode lots of large ints, this might be a considerable performance increase for them, and if most ints were single-digit, it would be about equal. For others, it would be significantly worse. I'm not sure what to do with this insight.

JustAnotherArchivist · 2022-06-03T18:32:04Z

python/JSONtoObj.c

+static JSOBJ Object_newIntegerFromString(void *prv, char *value, size_t length)
+{
+  // PyLong_FromString requires a NUL-terminated string in CPython, contrary to the documentation: https://github.com/python/cpython/issues/59200
+  char *buf = PyObject_Malloc(length + 1);
+  memcpy(buf, value, length);
+  buf[length] = '\0';
+  return PyLong_FromString(buf, NULL, 10);
+}
+


The memcpy here is certainly not ideal. It would probably be faster to buffer the next byte, replace it with a NUL, feed the char array through PyLong_FromString, and then restore it. I think that should always be possible/safe without buffer overruns, but I haven't spent a great deal of thought about it yet. Do you think it's worth exploring that sort of hack?

Ooh, now that's a disgusting but tantalising thing to do. Two ways I imagine that causing damage are:

The integer we're parsing is the last part of the JSON so the next byte would be out of bounds.

Someone is reading the JSON string in multiple threads and the second thread happens to look at that byte when it's temporarily nullified.

1 is moot if the string we're parsing is NULL terminated (which I'm fairly certain that they already are). 2 feels like a very nasty way to derail something.

The relevant part of the decoder boils down to:

If the argument is a string, call PyUnicode_AsUTF8String to convert to bytes.

Call PyBytes_AsString, which returns a NULL-terminated char* that is the internal buffer of the bytes object.

Iterate over that buffer.

So no out-of-bounds issue. As for multithreading, yes, that could be an issue (if the user passes bytes). But on the other hand, ujson already isn't thread-safe as far as I can tell. For example, modifying a dict from another thread while encoding it should crash the encoder.

I ran some tests on this, parsing an array equivalent to [2**64 + i for i in range(1000)], i.e. 1000 ints all using this code path. On my machine, parsing such a string averages 128 μs with memcpy and 108 μs with the suggested hack.

Compared to the other code paths for smaller ints (note that the number of digits obviously has a direct impact on throughput as well):

input time

2**64 + i, memcpy 128 μs

2**64 + i, buffer hack 108 μs

2**62 + i (newUnsignedLong) 60 μs

-2**63 + i (newLong) 64 μs

2**30 + i (newInt) 45 μs

-2**31 + i (newInt) 45 μs

Command, for reference: python3 -m timeit -s 'import ujson; s = "[" + ",".join(str(2**64 + i) for i in range(1000)) + "]"' 'ujson.loads(s)'

Implementation of the hack:

static JSOBJ Object_newIntegerFromString(void *prv, char *value, size_t length) { char buf = value[length]; value[length] = '\0'; PyObject *out = PyLong_FromString(value, NULL, 10); value[length] = buf; return out; }

That's a pretty small speedup really. On my machine it's even less (115μs vs 110μs). Definitely not worth the hack in my mind.

Yeah, agree. It's basically only relevant if all you ever do is parse huge integers. I don't think that's common enough to warrant the ugly hack.

JustAnotherArchivist · 2022-06-03T18:35:20Z

@bwoodsend Small comment on the previous discussion in #440 just to avoid confusion: I realised that an extra type isn't needed. JT_RAW already covers exactly this use case on encoding and decoding doesn't care about the different kind of ints anyway.

JustAnotherArchivist · 2022-06-16T03:08:31Z

Just to confirm, I did more performance checks on both encoding and decoding. As expected, throughput for ints within the previously supported range [-2**63, 2**64-1] is not measurably impacted by this change.

codecov-commenter · 2022-06-16T03:26:20Z

Codecov Report

Merging #548 (773c04a) into main (b47c3a7) will decrease coverage by 0.02%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##             main     #548      +/-   ##
==========================================
- Coverage   91.83%   91.80%   -0.03%     
==========================================
  Files           6        6              
  Lines        1824     1855      +31     
==========================================
+ Hits         1675     1703      +28     
- Misses        149      152       +3

Impacted Files	Coverage Δ
python/objToJSON.c	`89.92% <90.90%> (-0.22%)`	⬇️
tests/test_ujson.py	`99.27% <91.66%> (-0.17%)`	⬇️
lib/ultrajsondec.c	`92.56% <100.00%> (+0.06%)`	⬆️
python/JSONtoObj.c	`88.04% <100.00%> (+0.68%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b47c3a7...773c04a. Read the comment docs.

JustAnotherArchivist · 2022-06-16T03:54:08Z

Apparently PyPy's PyNumber_ToBase requires the value to fit into long long. I'll file an issue about this. It's unlikely to get fixed soon though, which would mean that arbitrary int encoding support would be restricted to CPython for the time being. (Decoding, i.e. PyLong_FromString, is fine though.)

Edit: Issue filed: https://foss.heptapod.net/pypy/pypy/-/issues/3765

tests/test_ujson.py

hugovk · 2022-06-17T05:55:56Z

Before merging, let's also test on the CI with PyPy on Windows.

Obviously #552 is a blocker, so we can either wait for that to be fixed, or could temporarily skip its failing test on PyPy/Windows to enable testing here first?

bwoodsend · 2022-06-17T18:40:18Z

xfailing the offending test on Windows+PyPy, re-enabling CI for PyPy on either Windows or all platforms then rebasing this PR on top of that would be my plan.

JustAnotherArchivist · 2022-06-17T20:56:02Z

Either way is fine with me.

hugovk · 2022-06-18T09:10:50Z

xfailing the offending test on Windows+PyPy, re-enabling CI for PyPy on either Windows or all platforms then rebasing this PR on top of that would be my plan.

Sounds good: #553

hugovk · 2022-06-18T21:22:39Z

#553 is merged, we should be good to merge this too. Thanks both!

JustAnotherArchivist commented Jun 3, 2022

View reviewed changes

JustAnotherArchivist force-pushed the arbitrary-ints branch 2 times, most recently from b33ec84 to 5c9cfa8 Compare June 16, 2022 03:26

JustAnotherArchivist added 3 commits June 16, 2022 17:26

Add support for arbitrary size integers

aa068e3

Fix tests for large ints

a8b35ef

Skip big integer encoding tests on PyPy for now

655e146

JustAnotherArchivist force-pushed the arbitrary-ints branch from 044701d to 655e146 Compare June 16, 2022 17:26

JustAnotherArchivist marked this pull request as ready for review June 16, 2022 17:33

bwoodsend reviewed Jun 16, 2022

View reviewed changes

tests/test_ujson.py Outdated Show resolved Hide resolved

JustAnotherArchivist added 2 commits June 16, 2022 22:10

Refactor test_encode_decode_big_int per Brénainn's suggestion

e2097ab

Add big int test with dict sorting

773c04a

bwoodsend approved these changes Jun 16, 2022

View reviewed changes

hugovk reviewed Jun 17, 2022

View reviewed changes

tests/test_ujson.py Outdated Show resolved Hide resolved

More descriptive variable names

99709df

JustAnotherArchivist force-pushed the arbitrary-ints branch from ab21a38 to 99709df Compare June 17, 2022 14:18

hugovk added the changelog: Added For new features label Jun 18, 2022

hugovk merged commit cc70119 into ultrajson:main Jun 18, 2022

sync-by-unito bot mentioned this pull request Jul 11, 2022

Bump ujson from 4.3.0 to 5.4.0 in /sample-projects/streaming-audio/FastAPI/live-transcription-fastapi deepgram/deepgram-python-sdk#28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for arbitrary size integers #548

Add support for arbitrary size integers #548

JustAnotherArchivist commented Jun 3, 2022

JustAnotherArchivist Jun 3, 2022

bwoodsend Jun 3, 2022 •

edited

JustAnotherArchivist Jun 3, 2022 •

edited

JustAnotherArchivist Jun 15, 2022

bwoodsend Jun 16, 2022

JustAnotherArchivist Jun 16, 2022

JustAnotherArchivist commented Jun 3, 2022

JustAnotherArchivist commented Jun 16, 2022

codecov-commenter commented Jun 16, 2022 •

edited

JustAnotherArchivist commented Jun 16, 2022 •

edited

hugovk commented Jun 17, 2022

bwoodsend commented Jun 17, 2022 •

edited

JustAnotherArchivist commented Jun 17, 2022

hugovk commented Jun 18, 2022

hugovk commented Jun 18, 2022

input	time
`2**64 + i`, `memcpy`	128 μs
`2**64 + i`, buffer hack	108 μs
`2**62 + i` (`newUnsignedLong`)	60 μs
`-2**63 + i` (`newLong`)	64 μs
`2**30 + i` (`newInt`)	45 μs
`-2**31 + i` (`newInt`)	45 μs

Add support for arbitrary size integers #548

Add support for arbitrary size integers #548

Conversation

JustAnotherArchivist commented Jun 3, 2022

JustAnotherArchivist Jun 3, 2022

Choose a reason for hiding this comment

bwoodsend Jun 3, 2022 • edited

Choose a reason for hiding this comment

JustAnotherArchivist Jun 3, 2022 • edited

Choose a reason for hiding this comment

JustAnotherArchivist Jun 15, 2022

Choose a reason for hiding this comment

bwoodsend Jun 16, 2022

Choose a reason for hiding this comment

JustAnotherArchivist Jun 16, 2022

Choose a reason for hiding this comment

JustAnotherArchivist commented Jun 3, 2022

JustAnotherArchivist commented Jun 16, 2022

codecov-commenter commented Jun 16, 2022 • edited

Codecov Report

JustAnotherArchivist commented Jun 16, 2022 • edited

hugovk commented Jun 17, 2022

bwoodsend commented Jun 17, 2022 • edited

JustAnotherArchivist commented Jun 17, 2022

hugovk commented Jun 18, 2022

hugovk commented Jun 18, 2022

bwoodsend Jun 3, 2022 •

edited

JustAnotherArchivist Jun 3, 2022 •

edited

codecov-commenter commented Jun 16, 2022 •

edited

JustAnotherArchivist commented Jun 16, 2022 •

edited

bwoodsend commented Jun 17, 2022 •

edited