Fix len integer overflow issue #567

marioga · 2022-10-04T23:29:01Z

Bug:

If ds->end - ds->start causes an int overflow here, we may end up truncating a double while parsing. This will then cause a decoder error, as the next token will be unexpected.

Changes proposed in this pull request:

Avoid integer overflow by setting len = min(INT_MAX, ds->end - ds->start) in decodeDouble.

codecov-commenter · 2022-10-04T23:43:05Z

Codecov Report

Merging #567 (ec095e4) into main (7db453b) will decrease coverage by 0.09%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##             main     #567      +/-   ##
==========================================
- Coverage   91.58%   91.49%   -0.10%     
==========================================
  Files           6        6              
  Lines        1902     1905       +3     
==========================================
+ Hits         1742     1743       +1     
- Misses        160      162       +2

Impacted Files	Coverage Δ
lib/ultrajsondec.c	`92.79% <ø> (-0.02%)`	⬇️
tests/test_ujson.py	`99.10% <50.00%> (-0.36%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

bwoodsend · 2022-10-07T10:46:51Z

Uhm, what exactly is this trying to fix? Can you give an example of where the original goes wrong but this doesn't. My guess would be something like:

ujson.loads('[...something just under INT_MAX bytes long..., 0.36543654365436543]')

but surely, just truncating the length would mean that you only parse the first few digits of the decimal?

If I'm wrong here, can you add some test cases which demonstrate what this change is doing?

marioga · 2022-10-11T17:52:57Z

Uhm, what exactly is this trying to fix? Can you give an example of where the original goes wrong but this doesn't. My guess would be something like:
ujson.loads('[...something just under INT_MAX bytes long..., 0.36543654365436543]')
but surely, just truncating the length would mean that you only parse the first few digits of the decimal?

If I'm wrong here, can you add some test cases which demonstrate what this change is doing?

Yup, it's something similar to what you describe. We encountered it in the wild with some large json files that failed to load and went down the rabbit hole.

Good point though, I'll add a test case.

for more information, see https://pre-commit.ci

marioga · 2022-10-11T19:58:37Z

I have added a unittest that raises ujson.JSONDecodeError: Unexpected character found when decoding array value (2) in the main branch but decodes correctly after the current change.

It's quite slow because the string to decode needs to have ~2**32 bytes. Please let me know if you want me to modify it in any way. Thanks!

bwoodsend · 2022-10-11T20:36:44Z

=================================== FAILURES ===================================
_____________________ test_decode_decimal_no_int_overflow ______________________

    def test_decode_decimal_no_int_overflow():
        # Takes a while because the string is large; feel free to comment out or remove
>       ujson.decode(r'[0.123456789,"{}"]'.format("a" * (2**32 - 5)))
E       ujson.JSONDecodeError: Could not reserve memory block

tests/test_ujson.py:1129: JSONDecodeError

Ahh, in hindsight, the Windows and Linux machines on CI only get 4GB of RAM each so a 4GB string is not going to happen there. I can't even run it on my 8GB home laptop. Hmm, not sure what would be best here...

Shouldn't the string be the other way around to make the overflow happen whilst parsing the decimal? i.e. r'["{}",0.123456789]'

marioga · 2022-10-11T21:03:26Z

=================================== FAILURES ===================================
_____________________ test_decode_decimal_no_int_overflow ______________________

    def test_decode_decimal_no_int_overflow():
        # Takes a while because the string is large; feel free to comment out or remove
>       ujson.decode(r'[0.123456789,"{}"]'.format("a" * (2**32 - 5)))
E       ujson.JSONDecodeError: Could not reserve memory block

tests/test_ujson.py:1129: JSONDecodeError
Ahh, in hindsight, the Windows and Linux machines on CI only get 4GB of RAM each so a 4GB string is not going to happen there. I can't even run it on my 8GB home laptop. Hmm, not sure what would be best here...

Shouldn't the string be the other way around to make the overflow happen whilst parsing the decimal? i.e. r'["{}",0.123456789]'

Ahh gotcha about the RAM. I'm running it on a dev server with plenty of RAM (384GB) and can reproduce it there. Not sure what the right course of action is either. :-/

You want the string to be that way. The issue is with casting ds->end - ds->start to an int: you want the distance to the end of the string to be massive when decoding the decimal. In this case, (int) (ds->end - ds->start) == 10 when ds->start points to the beginning of 0.123456789, which has 11 characters. We thus end up with a partially decoded double (the 9 is left dangling) and the exception follows.

bwoodsend · 2022-10-11T21:59:22Z

Ahh, I get it now. len isn't the length of the encoded decimal as I had assumed - it's how much unparsed JSON data we have left which gives dconv_s2d() an upper bound for how many bytes it's allowed to scan for the end of the decimal. In which case, truncating to INT_MAX is fine because no-one is going to write a decimal containing INT_MAX digits.

For the test I suppose we could do something like:

def test_decode_decimal_no_int_overflow():
    try:
        bytes(1 << 34)
    except MemoryError:
        pytest.skip()
    # rest of test here

Feels a bit yucky though. Any thoughts @hugovk?

hugovk · 2022-10-12T07:22:02Z

That's one option, it means they can at least be run locally on machines with lots of memory.

Another idea would be to add a skip decorator that checks there's at least a certain amount of memory available.

For Pillow, we have some memory/DoS tests which take a long time and named them check_x instead of test_x so they're not part of the main test suite. I doubt we run them often, but they're there. https://github.com/python-pillow/Pillow/tree/main/Tests

bwoodsend · 2022-10-12T17:38:38Z

Another idea would be to add a skip decorator that checks there's at least a certain amount of memory available.

I'm guessing that you don't know of a nicer way of doing the how much space do we have enough space left? query itself than trying to allocate a huge buffer? i.e. The decorator would have to be?

def needs_lots_of_memory(x):
    def wrapper(f):    
        def wrapped(*args, **kwargs):
            try:
                bytes(x)
            except MemoryError:
                pytest.skip()
            return f(*args, **kwargs)
        return wrapped
    return wrapper

hugovk · 2022-10-12T22:07:10Z

Oh look at that!

Perhaps adding a dependency to psutil and using psutil.virtual_memory()?

https://psutil.readthedocs.io/en/latest/#psutil.virtual_memory

I'm thinking we don't need to add a test if it's too complicated, and maybe just add a check_x script if anything.

marioga · 2022-10-13T17:41:55Z

Oh look at that!

Perhaps adding a dependency to psutil and using psutil.virtual_memory()?

https://psutil.readthedocs.io/en/latest/#psutil.virtual_memory

I'm thinking we don't need to add a test if it's too complicated, and maybe just add a check_x script if anything.

I'd go with the check_x renaming suggestion myself. My reasoning is that even if we can run that test in a given machine, it will be way slower. Not sure it's worth significantly slowing down the test suite for what seems like a one and done bug.

Perhaps I can just add a comment to my change and reference the corresponding check script for future maintenance?

bwoodsend · 2022-10-13T21:19:37Z

Let's go with check_x then.

Not sure it's worth significantly slowing down the test suite for what seems like a one and done bug.

Strangely, I imagine that there will be more to join it soon. Most of this codebase was written >10 years ago when RAMs were generally smaller. Just git grep-ing for other usages of int which aren't either booleans or flags lands me on this line which sure enough is another overflow bug:

import ujson
import sys

for size in [2 ** 31, 2 ** 32, 3 * 2 ** 30, 5 * 2 ** 30]:
    print("{:3,} => {:3,}".format(size, len(ujson.loads(ujson.dumps([0] * size)))))

2,147,483,648 => 2,147,483,648
4,294,967,296 => 2,147,483,648
3,221,225,472 => 2,147,483,648
5,368,709,120 => 3,221,225,471

for more information, see https://pre-commit.ci

marioga · 2022-10-14T17:58:15Z

I added some comments and changed the overflow test to a check. Following our previous discussion, I reserved a space inside tests/test_ujson.py for manual checks. Hope it looks better now. :-)

hugovk · 2022-10-14T18:20:03Z

Thank you!

Fix len integer overflow issue

fb96b90

hugovk added the changelog: Fixed For any bug fixes label Oct 6, 2022

Mario Garcia-Armas and others added 3 commits October 11, 2022 19:48

Added unittest where int overflow causes exception

6f6e69c

[pre-commit.ci] auto fixes from pre-commit.com hooks

2013758

for more information, see https://pre-commit.ci

Fix flake8

4d82888

Mario Garcia-Armas and others added 2 commits October 14, 2022 17:52

Add checks to test suite and improve comments

1161d5d

[pre-commit.ci] auto fixes from pre-commit.com hooks

ec095e4

for more information, see https://pre-commit.ci

bwoodsend approved these changes Oct 14, 2022

View reviewed changes

bwoodsend requested a review from hugovk October 14, 2022 18:13

hugovk approved these changes Oct 14, 2022

View reviewed changes

hugovk merged commit 36ced86 into ultrajson:main Oct 14, 2022

marioga deleted the marioga_prevent_int_overflow branch October 14, 2022 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix len integer overflow issue #567

Fix len integer overflow issue #567

marioga commented Oct 4, 2022

codecov-commenter commented Oct 4, 2022 •

edited

bwoodsend commented Oct 7, 2022

marioga commented Oct 11, 2022 •

edited

marioga commented Oct 11, 2022

bwoodsend commented Oct 11, 2022

marioga commented Oct 11, 2022 •

edited

bwoodsend commented Oct 11, 2022

hugovk commented Oct 12, 2022

bwoodsend commented Oct 12, 2022

hugovk commented Oct 12, 2022

marioga commented Oct 13, 2022

bwoodsend commented Oct 13, 2022

marioga commented Oct 14, 2022

hugovk commented Oct 14, 2022

Fix len integer overflow issue #567

Fix len integer overflow issue #567

Conversation

marioga commented Oct 4, 2022

codecov-commenter commented Oct 4, 2022 • edited

Codecov Report

bwoodsend commented Oct 7, 2022

marioga commented Oct 11, 2022 • edited

marioga commented Oct 11, 2022

bwoodsend commented Oct 11, 2022

marioga commented Oct 11, 2022 • edited

bwoodsend commented Oct 11, 2022

hugovk commented Oct 12, 2022

bwoodsend commented Oct 12, 2022

hugovk commented Oct 12, 2022

marioga commented Oct 13, 2022

bwoodsend commented Oct 13, 2022

marioga commented Oct 14, 2022

hugovk commented Oct 14, 2022

codecov-commenter commented Oct 4, 2022 •

edited

marioga commented Oct 11, 2022 •

edited

marioga commented Oct 11, 2022 •

edited