Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unchecked buffer overflows (CVE-2021-45958). #504

Closed
wants to merge 5 commits into from

Conversation

bwoodsend
Copy link
Collaborator

@bwoodsend bwoodsend commented Feb 9, 2022

Fixes #334,
fixes #501,
fixes #502,
fixes #503.

#402 also passes ok with this change but it worked before it too. I could only reproduce the issue by git checkout 2.0.2 so I vote that we close that one too.

Changes proposed in this pull request:

  • Add some extra memory resizes where needed.
  • Add some tests that used to but no longer cause memory issues.
  • Add a DEBUG compile mode to make memory issues easier to find and fix.
  • Include said DEBUG mode in CI/CD testing.

@bwoodsend
Copy link
Collaborator Author

I really hope that -D is a valid compiler flag for MSVC...

@codecov-commenter
Copy link

codecov-commenter commented Feb 9, 2022

Codecov Report

Merging #504 (6e7eeab) into main (ca1fd23) will increase coverage by 0.06%.
The diff coverage is 96.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #504      +/-   ##
==========================================
+ Coverage   88.96%   89.03%   +0.06%     
==========================================
  Files           6        6              
  Lines        1685     1723      +38     
==========================================
+ Hits         1499     1534      +35     
- Misses        186      189       +3     
Impacted Files Coverage Δ
lib/ultrajsonenc.c 82.29% <92.30%> (+0.84%) ⬆️
tests/test_ujson.py 99.56% <100.00%> (+0.01%) ⬆️
python/ujson.c 72.72% <0.00%> (-5.66%) ⬇️
python/JSONtoObj.c 86.41% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca1fd23...6e7eeab. Read the comment docs.

@bwoodsend
Copy link
Collaborator Author

Ahh phew. I've been spared having to dig around MS docs looking for the MSVC equivalent.

@bwoodsend bwoodsend marked this pull request as ready for review February 9, 2022 20:51
@bwoodsend bwoodsend requested a review from hugovk February 9, 2022 20:52
@bwoodsend
Copy link
Collaborator Author

@JustAnotherArchivist Now's your chance to see if you can dig out any more segfaults.

@JustAnotherArchivist
Copy link
Collaborator

Wonderful, thanks for this!

So I first looked at why #503 no longer happens. And to be honest, the reason for that seems ... not very great. Here's why. For every escaped string, there are now two buffer reservation calls. One is made before the quote is written to the buffer, the other after, but the length calculation accounts for the starting quote both times. This means that the reservation call inside Buffer_EscapeStringValidated effectively ensures that one extra byte after the string's ending quote is allocated. Therefore, the comma or closing bracket for the array or object cannot overrun the buffer. Relying on this seems insecure and prone to break with future changes to me – it does work though...

... unless you disable JSON_NO_EXTRA_WHITESPACE, because then a space is written after the comma following every item, and that space can overrun the buffer. For example, this segfaults when JSON_NO_EXTRA_WHITESPACE is not defined: python3 -c 'import ujson; ujson.dumps(["aa", "\x00" * 10921, "b"])'

Indentation with lists can still overrun the buffer even in the normal config though: python3 -c 'import ujson; ujson.dumps(["aaaaa", "\x00" * 10920, "b"*10], indent=1)'
The comma after the NUL string uses up the buffer, but then the newline for the indentation (not the indentation itself) is added without a further buffer reservation, so it overruns the buffer. In my quick tests, the element after it had to be at least a string of length 10 to trigger a SIGSEGV, although the buffer is definitely overrun regardless.

The code mentions a worst-case representation of doubles in a few places, and I think that could also be a potential way to trigger a segfault via lists/dicts. However, I'm not sure how to actually get such a double to play around with it. I tried a few values that I thought should have very long string representations but couldn't get anywhere near 256 bytes. To further my confusion, the buf variable in Buffer_AppendDoubleDconv is actually only 128 bytes, not 256...? I quickly glanced at the dconv code, but that's quite complex, so I didn't look into it further. I'd appreciate some insight into that to make sure that part's robust as well.

Another problematic scenario that comes to mind is that the indentation level as well as the indentation depth are defined as int, which is implementation-defined and may be as small as a signed 16-bit integer. While that's likely a non-issue for the actual values (who'd have 2^15 or more spaces per level or levels of indentation‽), the product of the two in the calculation of the buffer reservation could overflow in realistic scenarios, which would then let the indentation writes overrun the buffer. (This may be a minor concern because it'd require very deeply nested structures, large indentation, and an odd/old machine/compiler that doesn't have a 32-bit int as has of course been common for many years now.)

lib/ultrajsonenc.c Outdated Show resolved Hide resolved
Comment on lines +869 to +874
def test_dump_huge_indent():
ujson.encode({"a": True}, indent=65539)


def test_dump_long_string():
ujson.dumps(["aaaa", "\x00" * 10921])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to test broader than this, like light fuzz tests with slightly varying indentation depths and string lengths. That way, future changes that shift these things in minor ways (e.g. spaces after key and item separators by default) wouldn't render these tests immediately irrelevant.

@bwoodsend bwoodsend marked this pull request as draft February 10, 2022 08:39
@bwoodsend
Copy link
Collaborator Author

The code mentions a worst-case representation of doubles in a few places, and I think that could also be a potential way to trigger a segfault via lists/dicts. However, I'm not sure how to actually get such a double to play around with it. I tried a few values that I thought should have very long string representations but couldn't get anywhere near 256 bytes.

Likewise, the longest double I can produce is 23 bytes. If I try using infinite precision types like decimal.Decimal() it gets converted to float and truncated to something shorter.

>>> ujson.dumps(-1.000000000000001e100)
'-1.000000000000001e+100'

unless you disable JSON_NO_EXTRA_WHITESPACE, because then a space is written after the comma following every item

@hugovk How sentimental are you feeling about that JSON_NO_EXTRA_WHITESPACE mode? It would be easier just to remove it than have another permutation to have to test.

Indentation with lists can still overrun the buffer even in the normal config though: python3 -c 'import ujson; ujson.dumps(["aaaaa", "\x00" * 10920, "b"*10], indent=1)'
The comma after the NUL string uses up the buffer, but then the newline for the indentation (not the indentation itself) is added without a further buffer reservation, so it overruns the buffer. In my quick tests, the element after it had to be at least a string of length 10 to trigger a SIGSEGV, although the buffer is definitely overrun regardless.

Grr, back to the drawing board then...

@hugovk
Copy link
Member

hugovk commented Feb 10, 2022

@hugovk How sentimental are you feeling about that JSON_NO_EXTRA_WHITESPACE mode? It would be easier just to remove it than have another permutation to have to test.

Not sentimental at all. Are you proposing removing the flag and keeping the current default? For example:

-#ifdef JSON_NO_EXTRA_WHITESPACE
     if (enc->indent)
     {
       Buffer_AppendCharUnchecked (enc, ' ');
     }
-#else
-    Buffer_AppendCharUnchecked (enc, ' ');
-#endif

@hugovk
Copy link
Member

hugovk commented Feb 10, 2022

In general, things to consider:

  • Parity with stdlib json: well no compile flags there
  • Don't want to break lots of people's code:
  • Security can be more important than features

When removing things, it's always good to raise deprecation warnings in a release (or several) first, then remove with a major bump. Ideally deprecations should be around for some time, but for a security issue we can do it quickly.

@JustAnotherArchivist
Copy link
Collaborator

I agree with removing JSON_NO_EXTRA_WHITESPACE entirely as well. It's not even a simple compile flag as you can't set CFLAGS or similar but have to unset it in the header file. It's also undocumented as far as I can tell, so apart from a handful of people that dug around in the code, it's probably not even used much. And personally, I don't even bother with deprecation warnings for undocumented features on my own projects; if it's undocumented, it's subject to unannounced change at any time in my eyes, like any internal code. Also, I discovered these segfaults while looking into implementing #283, and the flag would be all but obsolete and an annoying complication with that.

@bwoodsend
Copy link
Collaborator Author

Eww not black! I hate black.

@bwoodsend bwoodsend marked this pull request as ready for review February 12, 2022 10:28
@bwoodsend
Copy link
Collaborator Author

I've introduced a fuzz test but I haven't made it part of the test suite or CI. Any preference as to what I do with it? It crunches all permutations of about 150 random objects a second on my PC so you can get quite good coverage by giving it say a minute on CI? When it has failed it normally fails within the first few objects anyway.

@JustAnotherArchivist
Copy link
Collaborator

I can't think of a way to overrun the buffer now – apart from the potential indent * level overflow on odd platforms and the potential unknown worst-case double representation mentioned above.

I still don't like that the code relies on the Buffer_Reserve call from Buffer_EscapeStringValidated or Buffer_EscapeStringUnvalidated reserving more buffer than the string encoding requires though. This can easily lead to similar issues in the future. Personally, I'd rather not have reserve calls in those functions at all (encode already reserves buffer before calling them) and then add an explicit reservation for the item separator and indentation newline. Such an explicit reservation would be needed for the separators implementation anyway since the separator will have a variable length.

Add a few extra memory reserve calls to account for the extra space that
indentation needs.

These kinds of memory issues are hard to spot because the buffer is resized in
powers of 2 meaning that a miscalculation would only show any symptoms if the
required buffer size is estimated to be just below a 2 power but is actually
just above. Add a debug mode which replaces the 2 power scheme with reserving
only the memory explicitly requested and adds some overflow checks.
@bwoodsend
Copy link
Collaborator Author

still don't like that the code relies on the Buffer_Reserve call from Buffer_EscapeStringValidated or Buffer_EscapeStringUnvalidated reserving more buffer than the string encoding requires though.

I take it that you're referring to lines like this where we deliberately overestimate the size of the string to include the comma (or possibly the newline) that will come after it? Yeah, it's not pretty. I notice now that a couple of them are redundant but honestly I think that strengthening the test suite will keep the ugliness from becoming regressions.

@JustAnotherArchivist
Copy link
Collaborator

Yes, that's what I'm referring to. And yeah, also some of the other reservations are bigger than needed (like the + 4 here where + 2 would be enough).
It's safe to overestimate, sure, but it can also lead to confusion in the future and less maintainable code, which can then turn into buffer overflows or other security issues again. And there's not really a good reason to overestimate, is there? If the logic is solid and it's obvious from the code why the reservations are always sufficient, that's exactly as safe and much clearer for anyone not deeply familiar with the codebase.

@bwoodsend
Copy link
Collaborator Author

If I reduce that +4 to +3 then tests/test_ujson.py::test_issue_334[4] overflows. I guess that probably implies that there should be another reserve call elsewhere instead of overestimating that one.

Copy link
Member

@hugovk hugovk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've introduced a fuzz test but I haven't made it part of the test suite or CI. Any preference as to what I do with it? It crunches all permutations of about 150 random objects a second on my PC so you can get quite good coverage by giving it say a minute on CI? When it has failed it normally fails within the first few objects anyway.

Sounds good, let's put it on the CI in its own workflow. A single ubuntu-latest and 3.10 should be enough?

tests/fuzz.py Outdated Show resolved Hide resolved
tests/fuzz.py Outdated Show resolved Hide resolved
Unsetting it can lead to seg-faults. I don't think it's worth having to fix and
then test this undocumented permutation.
@JustAnotherArchivist
Copy link
Collaborator

If I reduce that +4 to +3 then tests/test_ujson.py::test_issue_334[4] overflows. I guess that probably implies that there should be another reserve call elsewhere instead of overestimating that one.

Oh, I think I see why now. That +4 also covers a comma + newline on an outer array/object, like the one on strings did previously. So if you have input like [[..., "a"], "b"] with indentation, the +4 covers the newline after "a", the closing bracket for the inner array, the comma in the outer array, and the newline after it. I still think the only reasonable way of properly fixing this is an explicit Buffer_Reserve(enc, 2) reservation before appending the comma and newline (and just +2 at the end of arrays and objects).

By the way, the removal of the +1 on the inner string reservations again causes a buffer overrun on ujson.dumps(["aaaaa", "\x00" * 10920, ""], indent=1). (As before, I have to use "b"*10 instead of an empty string to get an actual segfault.) This would be fixed automatically with the explicit comma/newline reservation above, as the reservations from the string functions wouldn't be needed at all then.

default=(0, 1),
action=ListOption,
help="Sets the ensure_ascii option to ujson.dumps(). "
"May be 0 or 1 or 0,1 to testboth.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"May be 0 or 1 or 0,1 to testboth.",
"May be 0 or 1 or 0,1 to test both.",

@bwoodsend
Copy link
Collaborator Author

By the way, the removal of the +1 on the inner string reservations again causes a buffer overrun on ujson.dumps(["aaaaa", "\x00" * 10920, ""], indent=1). (As before, I have to use "b"*10 instead of an empty string to get an actual segfault.) This would be fixed automatically with the explicit comma/newline reservation above, as the reservations from the string functions wouldn't be needed at all then.

Ughh, I hate C memory

python -m pip install -U pip
python -m pip install .
env:
CFLAGS: '-DDEBUG'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we might want to do in testing this is to change the initial size of the buffer in objToJSON to something much, much smaller. This will slow it down due to the repeated resizing of the buffer, but means we'll spend much more time near the limit of the buffer.

import ujson


class FuzzGenerator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to implement this all ourselves rather than using hypothesis?

@JustAnotherArchivist
Copy link
Collaborator

Yeah, this PR doesn't fix the overflows yet, cf. my last comment.

@bwoodsend
Copy link
Collaborator Author

@JustAnotherArchivist Would you be willing to take over this one? I lack both the time and the knowledge of C to do this effectively.

@JustAnotherArchivist
Copy link
Collaborator

@bwoodsend Yeah, sure. Plenty of the changes here are fine, e.g. the testing and the extra whitespace removal. Shall I just start from your branch, make my changes in separate commits, and create a PR with everything in the end?

@bwoodsend
Copy link
Collaborator Author

Thanks. Just do whatever's easiest (which probably is to branch off my branch as you say). If my commit authorship gets clobbered in the process then that's fine.

@CarolynWebster
Copy link

Really appreciate all the work being put in here! All of our builds are currently blocked due to the security warning, and would love to avoid replacing ujson if we don't have to. Is there a timeline on this fix?

@JustAnotherArchivist
Copy link
Collaborator

I can't provide an ETA, though I hope to finish it soon. I have made the necessary changes locally, but verification that it is safe is still ongoing.

@JustAnotherArchivist
Copy link
Collaborator

Replacement PR is now up: #519

It includes all commits from this PR (although not exactly due to rebasing onto main).

@bwoodsend bwoodsend closed this Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: Fixed For any bug fixes
Projects
None yet
7 participants