New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix len integer overflow issue #567
Conversation
Codecov Report
@@ Coverage Diff @@
## main #567 +/- ##
==========================================
- Coverage 91.58% 91.49% -0.10%
==========================================
Files 6 6
Lines 1902 1905 +3
==========================================
+ Hits 1742 1743 +1
- Misses 160 162 +2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Uhm, what exactly is this trying to fix? Can you give an example of where the original goes wrong but this doesn't. My guess would be something like: ujson.loads('[...something just under INT_MAX bytes long..., 0.36543654365436543]') but surely, just truncating the length would mean that you only parse the first few digits of the decimal? If I'm wrong here, can you add some test cases which demonstrate what this change is doing? |
Yup, it's something similar to what you describe. We encountered it in the wild with some large json files that failed to load and went down the rabbit hole. Good point though, I'll add a test case. |
I have added a unittest that raises It's quite slow because the string to decode needs to have ~2**32 bytes. Please let me know if you want me to modify it in any way. Thanks! |
=================================== FAILURES ===================================
_____________________ test_decode_decimal_no_int_overflow ______________________
def test_decode_decimal_no_int_overflow():
# Takes a while because the string is large; feel free to comment out or remove
> ujson.decode(r'[0.123456789,"{}"]'.format("a" * (2**32 - 5)))
E ujson.JSONDecodeError: Could not reserve memory block
tests/test_ujson.py:1129: JSONDecodeError Ahh, in hindsight, the Windows and Linux machines on CI only get 4GB of RAM each so a 4GB string is not going to happen there. I can't even run it on my 8GB home laptop. Hmm, not sure what would be best here... Shouldn't the string be the other way around to make the overflow happen whilst parsing the decimal? i.e. |
Ahh gotcha about the RAM. I'm running it on a dev server with plenty of RAM (384GB) and can reproduce it there. Not sure what the right course of action is either. :-/ You want the string to be that way. The issue is with casting |
Ahh, I get it now. For the test I suppose we could do something like: def test_decode_decimal_no_int_overflow():
try:
bytes(1 << 34)
except MemoryError:
pytest.skip()
# rest of test here Feels a bit yucky though. Any thoughts @hugovk? |
That's one option, it means they can at least be run locally on machines with lots of memory. Another idea would be to add a skip decorator that checks there's at least a certain amount of memory available. For Pillow, we have some memory/DoS tests which take a long time and named them |
I'm guessing that you don't know of a nicer way of doing the how much space do we have enough space left? query itself than trying to allocate a huge buffer? i.e. The decorator would have to be? def needs_lots_of_memory(x):
def wrapper(f):
def wrapped(*args, **kwargs):
try:
bytes(x)
except MemoryError:
pytest.skip()
return f(*args, **kwargs)
return wrapped
return wrapper |
Oh look at that! Perhaps adding a dependency to psutil and using https://psutil.readthedocs.io/en/latest/#psutil.virtual_memory I'm thinking we don't need to add a test if it's too complicated, and maybe just add a |
I'd go with the Perhaps I can just add a comment to my change and reference the corresponding check script for future maintenance? |
Let's go with
Strangely, I imagine that there will be more to join it soon. Most of this codebase was written >10 years ago when RAMs were generally smaller. Just import ujson
import sys
for size in [2 ** 31, 2 ** 32, 3 * 2 ** 30, 5 * 2 ** 30]:
print("{:3,} => {:3,}".format(size, len(ujson.loads(ujson.dumps([0] * size)))))
|
I added some comments and changed the overflow test to a check. Following our previous discussion, I reserved a space inside |
Thank you! |
Bug:
If
ds->end - ds->start
causes anint
overflow here, we may end up truncating a double while parsing. This will then cause a decoder error, as the next token will be unexpected.Changes proposed in this pull request:
Avoid integer overflow by setting
len = min(INT_MAX, ds->end - ds->start)
indecodeDouble
.