New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for arbitrary size integers #548
Merged
Merged
Changes from 5 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
aa068e3
Add support for arbitrary size integers
JustAnotherArchivist a8b35ef
Fix tests for large ints
JustAnotherArchivist 655e146
Skip big integer encoding tests on PyPy for now
JustAnotherArchivist e2097ab
Refactor test_encode_decode_big_int per Brénainn's suggestion
JustAnotherArchivist 773c04a
Add big int test with dict sorting
JustAnotherArchivist 99709df
More descriptive variable names
JustAnotherArchivist File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
memcpy
here is certainly not ideal. It would probably be faster to buffer the next byte, replace it with a NUL, feed the char array throughPyLong_FromString
, and then restore it. I think that should always be possible/safe without buffer overruns, but I haven't spent a great deal of thought about it yet. Do you think it's worth exploring that sort of hack?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, now that's a disgusting but tantalising thing to do. Two ways I imagine that causing damage are:
1 is moot if the string we're parsing is NULL terminated (which I'm fairly certain that they already are). 2 feels like a very nasty way to derail something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relevant part of the decoder boils down to:
PyUnicode_AsUTF8String
to convert tobytes
.PyBytes_AsString
, which returns a NULL-terminatedchar*
that is the internal buffer of thebytes
object.So no out-of-bounds issue. As for multithreading, yes, that could be an issue (if the user passes
bytes
). But on the other hand, ujson already isn't thread-safe as far as I can tell. For example, modifying a dict from another thread while encoding it should crash the encoder.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some tests on this, parsing an array equivalent to
[2**64 + i for i in range(1000)]
, i.e. 1000 ints all using this code path. On my machine, parsing such a string averages 128 μs withmemcpy
and 108 μs with the suggested hack.Compared to the other code paths for smaller ints (note that the number of digits obviously has a direct impact on throughput as well):
2**64 + i
,memcpy
2**64 + i
, buffer hack2**62 + i
(newUnsignedLong
)-2**63 + i
(newLong
)2**30 + i
(newInt
)-2**31 + i
(newInt
)Command, for reference:
python3 -m timeit -s 'import ujson; s = "[" + ",".join(str(2**64 + i) for i in range(1000)) + "]"' 'ujson.loads(s)'
Implementation of the hack:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a pretty small speedup really. On my machine it's even less (115μs vs 110μs). Definitely not worth the hack in my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree. It's basically only relevant if all you ever do is parse huge integers. I don't think that's common enough to warrant the ugly hack.