Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix len integer overflow issue #567

Merged
merged 6 commits into from Oct 14, 2022

Conversation

marioga
Copy link
Contributor

@marioga marioga commented Oct 4, 2022

Bug:

If ds->end - ds->start causes an int overflow here, we may end up truncating a double while parsing. This will then cause a decoder error, as the next token will be unexpected.

Changes proposed in this pull request:

Avoid integer overflow by setting len = min(INT_MAX, ds->end - ds->start) in decodeDouble.

@codecov-commenter
Copy link

codecov-commenter commented Oct 4, 2022

Codecov Report

Merging #567 (ec095e4) into main (7db453b) will decrease coverage by 0.09%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##             main     #567      +/-   ##
==========================================
- Coverage   91.58%   91.49%   -0.10%     
==========================================
  Files           6        6              
  Lines        1902     1905       +3     
==========================================
+ Hits         1742     1743       +1     
- Misses        160      162       +2     
Impacted Files Coverage Δ
lib/ultrajsondec.c 92.79% <ø> (-0.02%) ⬇️
tests/test_ujson.py 99.10% <50.00%> (-0.36%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@hugovk hugovk added the changelog: Fixed For any bug fixes label Oct 6, 2022
@bwoodsend
Copy link
Collaborator

Uhm, what exactly is this trying to fix? Can you give an example of where the original goes wrong but this doesn't. My guess would be something like:

ujson.loads('[...something just under INT_MAX bytes long..., 0.36543654365436543]')

but surely, just truncating the length would mean that you only parse the first few digits of the decimal?

If I'm wrong here, can you add some test cases which demonstrate what this change is doing?

@marioga
Copy link
Contributor Author

marioga commented Oct 11, 2022

Uhm, what exactly is this trying to fix? Can you give an example of where the original goes wrong but this doesn't. My guess would be something like:

ujson.loads('[...something just under INT_MAX bytes long..., 0.36543654365436543]')

but surely, just truncating the length would mean that you only parse the first few digits of the decimal?

If I'm wrong here, can you add some test cases which demonstrate what this change is doing?

Yup, it's something similar to what you describe. We encountered it in the wild with some large json files that failed to load and went down the rabbit hole.

Good point though, I'll add a test case.

@marioga
Copy link
Contributor Author

marioga commented Oct 11, 2022

I have added a unittest that raises ujson.JSONDecodeError: Unexpected character found when decoding array value (2) in the main branch but decodes correctly after the current change.

It's quite slow because the string to decode needs to have ~2**32 bytes. Please let me know if you want me to modify it in any way. Thanks!

@bwoodsend
Copy link
Collaborator

=================================== FAILURES ===================================
_____________________ test_decode_decimal_no_int_overflow ______________________

    def test_decode_decimal_no_int_overflow():
        # Takes a while because the string is large; feel free to comment out or remove
>       ujson.decode(r'[0.123456789,"{}"]'.format("a" * (2**32 - 5)))
E       ujson.JSONDecodeError: Could not reserve memory block

tests/test_ujson.py:1129: JSONDecodeError

Ahh, in hindsight, the Windows and Linux machines on CI only get 4GB of RAM each so a 4GB string is not going to happen there. I can't even run it on my 8GB home laptop. Hmm, not sure what would be best here...

Shouldn't the string be the other way around to make the overflow happen whilst parsing the decimal? i.e. r'["{}",0.123456789]'

@marioga
Copy link
Contributor Author

marioga commented Oct 11, 2022

=================================== FAILURES ===================================
_____________________ test_decode_decimal_no_int_overflow ______________________

    def test_decode_decimal_no_int_overflow():
        # Takes a while because the string is large; feel free to comment out or remove
>       ujson.decode(r'[0.123456789,"{}"]'.format("a" * (2**32 - 5)))
E       ujson.JSONDecodeError: Could not reserve memory block

tests/test_ujson.py:1129: JSONDecodeError

Ahh, in hindsight, the Windows and Linux machines on CI only get 4GB of RAM each so a 4GB string is not going to happen there. I can't even run it on my 8GB home laptop. Hmm, not sure what would be best here...

Shouldn't the string be the other way around to make the overflow happen whilst parsing the decimal? i.e. r'["{}",0.123456789]'

Ahh gotcha about the RAM. I'm running it on a dev server with plenty of RAM (384GB) and can reproduce it there. Not sure what the right course of action is either. :-/

You want the string to be that way. The issue is with casting ds->end - ds->start to an int: you want the distance to the end of the string to be massive when decoding the decimal. In this case, (int) (ds->end - ds->start) == 10 when ds->start points to the beginning of 0.123456789, which has 11 characters. We thus end up with a partially decoded double (the 9 is left dangling) and the exception follows.

@bwoodsend
Copy link
Collaborator

Ahh, I get it now. len isn't the length of the encoded decimal as I had assumed - it's how much unparsed JSON data we have left which gives dconv_s2d() an upper bound for how many bytes it's allowed to scan for the end of the decimal. In which case, truncating to INT_MAX is fine because no-one is going to write a decimal containing INT_MAX digits.

For the test I suppose we could do something like:

def test_decode_decimal_no_int_overflow():
    try:
        bytes(1 << 34)
    except MemoryError:
        pytest.skip()
    # rest of test here

Feels a bit yucky though. Any thoughts @hugovk?

@hugovk
Copy link
Member

hugovk commented Oct 12, 2022

That's one option, it means they can at least be run locally on machines with lots of memory.

Another idea would be to add a skip decorator that checks there's at least a certain amount of memory available.

For Pillow, we have some memory/DoS tests which take a long time and named them check_x instead of test_x so they're not part of the main test suite. I doubt we run them often, but they're there. https://github.com/python-pillow/Pillow/tree/main/Tests

@bwoodsend
Copy link
Collaborator

Another idea would be to add a skip decorator that checks there's at least a certain amount of memory available.

I'm guessing that you don't know of a nicer way of doing the how much space do we have enough space left? query itself than trying to allocate a huge buffer? i.e. The decorator would have to be?

def needs_lots_of_memory(x):
    def wrapper(f):    
        def wrapped(*args, **kwargs):
            try:
                bytes(x)
            except MemoryError:
                pytest.skip()
            return f(*args, **kwargs)
        return wrapped
    return wrapper

@hugovk
Copy link
Member

hugovk commented Oct 12, 2022

Oh look at that!

Perhaps adding a dependency to psutil and using psutil.virtual_memory()?

https://psutil.readthedocs.io/en/latest/#psutil.virtual_memory

I'm thinking we don't need to add a test if it's too complicated, and maybe just add a check_x script if anything.

@marioga
Copy link
Contributor Author

marioga commented Oct 13, 2022

Oh look at that!

Perhaps adding a dependency to psutil and using psutil.virtual_memory()?

https://psutil.readthedocs.io/en/latest/#psutil.virtual_memory

I'm thinking we don't need to add a test if it's too complicated, and maybe just add a check_x script if anything.

I'd go with the check_x renaming suggestion myself. My reasoning is that even if we can run that test in a given machine, it will be way slower. Not sure it's worth significantly slowing down the test suite for what seems like a one and done bug.

Perhaps I can just add a comment to my change and reference the corresponding check script for future maintenance?

@bwoodsend
Copy link
Collaborator

Let's go with check_x then.

Not sure it's worth significantly slowing down the test suite for what seems like a one and done bug.

Strangely, I imagine that there will be more to join it soon. Most of this codebase was written >10 years ago when RAMs were generally smaller. Just git grep-ing for other usages of int which aren't either booleans or flags lands me on this line which sure enough is another overflow bug:

import ujson
import sys

for size in [2 ** 31, 2 ** 32, 3 * 2 ** 30, 5 * 2 ** 30]:
    print("{:3,} => {:3,}".format(size, len(ujson.loads(ujson.dumps([0] * size)))))
2,147,483,648 => 2,147,483,648
4,294,967,296 => 2,147,483,648
3,221,225,472 => 2,147,483,648
5,368,709,120 => 3,221,225,471

@marioga
Copy link
Contributor Author

marioga commented Oct 14, 2022

I added some comments and changed the overflow test to a check. Following our previous discussion, I reserved a space inside tests/test_ujson.py for manual checks. Hope it looks better now. :-)

@bwoodsend bwoodsend requested a review from hugovk October 14, 2022 18:13
@hugovk
Copy link
Member

hugovk commented Oct 14, 2022

Thank you!

@hugovk hugovk merged commit 36ced86 into ultrajson:main Oct 14, 2022
@marioga marioga deleted the marioga_prevent_int_overflow branch October 14, 2022 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: Fixed For any bug fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants