Fix unchecked buffer overflows (CVE-2021-45958). #504

bwoodsend · 2022-02-09T20:45:14Z

Fixes #334,
fixes #501,
fixes #502,
fixes #503.

#402 also passes ok with this change but it worked before it too. I could only reproduce the issue by git checkout 2.0.2 so I vote that we close that one too.

Changes proposed in this pull request:

Add some extra memory resizes where needed.
Add some tests that used to but no longer cause memory issues.
Add a DEBUG compile mode to make memory issues easier to find and fix.
Include said DEBUG mode in CI/CD testing.

bwoodsend · 2022-02-09T20:45:41Z

I really hope that -D is a valid compiler flag for MSVC...

codecov-commenter · 2022-02-09T20:46:27Z

Codecov Report

Merging #504 (6e7eeab) into main (ca1fd23) will increase coverage by 0.06%.
The diff coverage is 96.55%.

@@            Coverage Diff             @@
##             main     #504      +/-   ##
==========================================
+ Coverage   88.96%   89.03%   +0.06%     
==========================================
  Files           6        6              
  Lines        1685     1723      +38     
==========================================
+ Hits         1499     1534      +35     
- Misses        186      189       +3

Impacted Files	Coverage Δ
lib/ultrajsonenc.c	`82.29% <92.30%> (+0.84%)`	⬆️
tests/test_ujson.py	`99.56% <100.00%> (+0.01%)`	⬆️
python/ujson.c	`72.72% <0.00%> (-5.66%)`	⬇️
python/JSONtoObj.c	`86.41% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca1fd23...6e7eeab. Read the comment docs.

bwoodsend · 2022-02-09T20:51:20Z

Ahh phew. I've been spared having to dig around MS docs looking for the MSVC equivalent.

bwoodsend · 2022-02-09T20:54:07Z

@JustAnotherArchivist Now's your chance to see if you can dig out any more segfaults.

JustAnotherArchivist · 2022-02-10T01:48:48Z

Wonderful, thanks for this!

So I first looked at why #503 no longer happens. And to be honest, the reason for that seems ... not very great. Here's why. For every escaped string, there are now two buffer reservation calls. One is made before the quote is written to the buffer, the other after, but the length calculation accounts for the starting quote both times. This means that the reservation call inside Buffer_EscapeStringValidated effectively ensures that one extra byte after the string's ending quote is allocated. Therefore, the comma or closing bracket for the array or object cannot overrun the buffer. Relying on this seems insecure and prone to break with future changes to me – it does work though...

... unless you disable JSON_NO_EXTRA_WHITESPACE, because then a space is written after the comma following every item, and that space can overrun the buffer. For example, this segfaults when JSON_NO_EXTRA_WHITESPACE is not defined: python3 -c 'import ujson; ujson.dumps(["aa", "\x00" * 10921, "b"])'

Indentation with lists can still overrun the buffer even in the normal config though: python3 -c 'import ujson; ujson.dumps(["aaaaa", "\x00" * 10920, "b"*10], indent=1)'
The comma after the NUL string uses up the buffer, but then the newline for the indentation (not the indentation itself) is added without a further buffer reservation, so it overruns the buffer. In my quick tests, the element after it had to be at least a string of length 10 to trigger a SIGSEGV, although the buffer is definitely overrun regardless.

The code mentions a worst-case representation of doubles in a few places, and I think that could also be a potential way to trigger a segfault via lists/dicts. However, I'm not sure how to actually get such a double to play around with it. I tried a few values that I thought should have very long string representations but couldn't get anywhere near 256 bytes. To further my confusion, the buf variable in Buffer_AppendDoubleDconv is actually only 128 bytes, not 256...? I quickly glanced at the dconv code, but that's quite complex, so I didn't look into it further. I'd appreciate some insight into that to make sure that part's robust as well.

Another problematic scenario that comes to mind is that the indentation level as well as the indentation depth are defined as int, which is implementation-defined and may be as small as a signed 16-bit integer. While that's likely a non-issue for the actual values (who'd have 2^15 or more spaces per level or levels of indentation‽), the product of the two in the calculation of the buffer reservation could overflow in realistic scenarios, which would then let the indentation writes overrun the buffer. (This may be a minor concern because it'd require very deeply nested structures, large indentation, and an odd/old machine/compiler that doesn't have a 32-bit int as has of course been common for many years now.)

lib/ultrajsonenc.c

JustAnotherArchivist · 2022-02-10T01:49:42Z

tests/test_ujson.py

+def test_dump_huge_indent():
+    ujson.encode({"a": True}, indent=65539)
+
+
+def test_dump_long_string():
+    ujson.dumps(["aaaa", "\x00" * 10921])


It's probably a good idea to test broader than this, like light fuzz tests with slightly varying indentation depths and string lengths. That way, future changes that shift these things in minor ways (e.g. spaces after key and item separators by default) wouldn't render these tests immediately irrelevant.

bwoodsend · 2022-02-10T08:51:57Z

The code mentions a worst-case representation of doubles in a few places, and I think that could also be a potential way to trigger a segfault via lists/dicts. However, I'm not sure how to actually get such a double to play around with it. I tried a few values that I thought should have very long string representations but couldn't get anywhere near 256 bytes.

Likewise, the longest double I can produce is 23 bytes. If I try using infinite precision types like decimal.Decimal() it gets converted to float and truncated to something shorter.

>>> ujson.dumps(-1.000000000000001e100)
'-1.000000000000001e+100'

unless you disable JSON_NO_EXTRA_WHITESPACE, because then a space is written after the comma following every item

@hugovk How sentimental are you feeling about that JSON_NO_EXTRA_WHITESPACE mode? It would be easier just to remove it than have another permutation to have to test.

Indentation with lists can still overrun the buffer even in the normal config though: python3 -c 'import ujson; ujson.dumps(["aaaaa", "\x00" * 10920, "b"*10], indent=1)'
The comma after the NUL string uses up the buffer, but then the newline for the indentation (not the indentation itself) is added without a further buffer reservation, so it overruns the buffer. In my quick tests, the element after it had to be at least a string of length 10 to trigger a SIGSEGV, although the buffer is definitely overrun regardless.

Grr, back to the drawing board then...

hugovk · 2022-02-10T09:14:23Z

@hugovk How sentimental are you feeling about that JSON_NO_EXTRA_WHITESPACE mode? It would be easier just to remove it than have another permutation to have to test.

Not sentimental at all. Are you proposing removing the flag and keeping the current default? For example:

-#ifdef JSON_NO_EXTRA_WHITESPACE
     if (enc->indent)
     {
       Buffer_AppendCharUnchecked (enc, ' ');
     }
-#else
-    Buffer_AppendCharUnchecked (enc, ' ');
-#endif

hugovk · 2022-02-10T09:21:44Z

In general, things to consider:

Parity with stdlib json: well no compile flags there
Don't want to break lots of people's code:
- hard to judge usage
- searching this repo, beyond recent security things only JSON_NO_EXTRA_WHITESPACE error #245
- searching GitHub doesn't really bring much up
Security can be more important than features

When removing things, it's always good to raise deprecation warnings in a release (or several) first, then remove with a major bump. Ideally deprecations should be around for some time, but for a security issue we can do it quickly.

JustAnotherArchivist · 2022-02-10T12:39:56Z

I agree with removing JSON_NO_EXTRA_WHITESPACE entirely as well. It's not even a simple compile flag as you can't set CFLAGS or similar but have to unset it in the header file. It's also undocumented as far as I can tell, so apart from a handful of people that dug around in the code, it's probably not even used much. And personally, I don't even bother with deprecation warnings for undocumented features on my own projects; if it's undocumented, it's subject to unannounced change at any time in my eyes, like any internal code. Also, I discovered these segfaults while looking into implementing #283, and the flag would be all but obsolete and an annoying complication with that.

bwoodsend · 2022-02-12T09:57:43Z

Eww not black! I hate black.

bwoodsend · 2022-02-12T10:35:34Z

I've introduced a fuzz test but I haven't made it part of the test suite or CI. Any preference as to what I do with it? It crunches all permutations of about 150 random objects a second on my PC so you can get quite good coverage by giving it say a minute on CI? When it has failed it normally fails within the first few objects anyway.

JustAnotherArchivist · 2022-02-12T17:42:00Z

I can't think of a way to overrun the buffer now – apart from the potential indent * level overflow on odd platforms and the potential unknown worst-case double representation mentioned above.

I still don't like that the code relies on the Buffer_Reserve call from Buffer_EscapeStringValidated or Buffer_EscapeStringUnvalidated reserving more buffer than the string encoding requires though. This can easily lead to similar issues in the future. Personally, I'd rather not have reserve calls in those functions at all (encode already reserves buffer before calling them) and then add an explicit reservation for the item separator and indentation newline. Such an explicit reservation would be needed for the separators implementation anyway since the separator will have a variable length.

Add a few extra memory reserve calls to account for the extra space that indentation needs. These kinds of memory issues are hard to spot because the buffer is resized in powers of 2 meaning that a miscalculation would only show any symptoms if the required buffer size is estimated to be just below a 2 power but is actually just above. Add a debug mode which replaces the 2 power scheme with reserving only the memory explicitly requested and adds some overflow checks.

bwoodsend · 2022-02-12T21:58:38Z

still don't like that the code relies on the Buffer_Reserve call from Buffer_EscapeStringValidated or Buffer_EscapeStringUnvalidated reserving more buffer than the string encoding requires though.

I take it that you're referring to lines like this where we deliberately overestimate the size of the string to include the comma (or possibly the newline) that will come after it? Yeah, it's not pretty. I notice now that a couple of them are redundant but honestly I think that strengthening the test suite will keep the ugliness from becoming regressions.

JustAnotherArchivist · 2022-02-13T01:17:48Z

Yes, that's what I'm referring to. And yeah, also some of the other reservations are bigger than needed (like the + 4 here where + 2 would be enough).
It's safe to overestimate, sure, but it can also lead to confusion in the future and less maintainable code, which can then turn into buffer overflows or other security issues again. And there's not really a good reason to overestimate, is there? If the logic is solid and it's obvious from the code why the reservations are always sufficient, that's exactly as safe and much clearer for anyone not deeply familiar with the codebase.

bwoodsend · 2022-02-13T10:22:34Z

If I reduce that +4 to +3 then tests/test_ujson.py::test_issue_334[4] overflows. I guess that probably implies that there should be another reserve call elsewhere instead of overestimating that one.

hugovk

I've introduced a fuzz test but I haven't made it part of the test suite or CI. Any preference as to what I do with it? It crunches all permutations of about 150 random objects a second on my PC so you can get quite good coverage by giving it say a minute on CI? When it has failed it normally fails within the first few objects anyway.

Sounds good, let's put it on the CI in its own workflow. A single ubuntu-latest and 3.10 should be enough?

tests/fuzz.py

Unsetting it can lead to seg-faults. I don't think it's worth having to fix and then test this undocumented permutation.

JustAnotherArchivist · 2022-02-13T17:24:47Z

If I reduce that +4 to +3 then tests/test_ujson.py::test_issue_334[4] overflows. I guess that probably implies that there should be another reserve call elsewhere instead of overestimating that one.

Oh, I think I see why now. That +4 also covers a comma + newline on an outer array/object, like the one on strings did previously. So if you have input like [[..., "a"], "b"] with indentation, the +4 covers the newline after "a", the closing bracket for the inner array, the comma in the outer array, and the newline after it. I still think the only reasonable way of properly fixing this is an explicit Buffer_Reserve(enc, 2) reservation before appending the comma and newline (and just +2 at the end of arrays and objects).

By the way, the removal of the +1 on the inner string reservations again causes a buffer overrun on ujson.dumps(["aaaaa", "\x00" * 10920, ""], indent=1). (As before, I have to use "b"*10 instead of an empty string to get an actual segfault.) This would be fixed automatically with the explicit comma/newline reservation above, as the reservations from the string functions wouldn't be needed at all then.

hugovk · 2022-02-13T19:17:53Z

tests/fuzz.py

+    default=(0, 1),
+    action=ListOption,
+    help="Sets the ensure_ascii option to ujson.dumps(). "
+    "May be 0 or 1 or 0,1 to testboth.",


Suggested change

"May be 0 or 1 or 0,1 to testboth.",

"May be 0 or 1 or 0,1 to test both.",

bwoodsend · 2022-02-15T20:41:04Z

By the way, the removal of the +1 on the inner string reservations again causes a buffer overrun on ujson.dumps(["aaaaa", "\x00" * 10920, ""], indent=1). (As before, I have to use "b"*10 instead of an empty string to get an actual segfault.) This would be fixed automatically with the explicit comma/newline reservation above, as the reservations from the string functions wouldn't be needed at all then.

Ughh, I hate C memory

gsnedders · 2022-02-26T15:41:26Z

.github/workflows/fuzz.yml

+          python -m pip install -U pip
+          python -m pip install .
+        env:
+          CFLAGS: '-DDEBUG'


One thing we might want to do in testing this is to change the initial size of the buffer in objToJSON to something much, much smaller. This will slow it down due to the repeated resizing of the buffer, but means we'll spend much more time near the limit of the buffer.

gsnedders · 2022-02-26T15:46:09Z

tests/fuzz.py

+import ujson
+
+
+class FuzzGenerator:


Does it make sense to implement this all ourselves rather than using hypothesis?

konr4d · 2022-03-23T10:42:03Z

Is there anything blocking the merge?

This vulnerability ranks high on severity scale:
https://cwe.mitre.org/data/definitions/787.html
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-45958
https://www.first.org/cvss/calculator/3.1#CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H
https://security.snyk.io/vuln/SNYK-PYTHON-UJSON-2359034

JustAnotherArchivist · 2022-03-23T15:47:27Z

Yeah, this PR doesn't fix the overflows yet, cf. my last comment.

bwoodsend · 2022-03-23T18:58:36Z

@JustAnotherArchivist Would you be willing to take over this one? I lack both the time and the knowledge of C to do this effectively.

JustAnotherArchivist · 2022-03-23T21:10:22Z

@bwoodsend Yeah, sure. Plenty of the changes here are fine, e.g. the testing and the extra whitespace removal. Shall I just start from your branch, make my changes in separate commits, and create a PR with everything in the end?

bwoodsend · 2022-03-23T22:08:40Z

Thanks. Just do whatever's easiest (which probably is to branch off my branch as you say). If my commit authorship gets clobbered in the process then that's fine.

CarolynWebster · 2022-04-01T16:46:44Z

Really appreciate all the work being put in here! All of our builds are currently blocked due to the security warning, and would love to avoid replacing ujson if we don't have to. Is there a timeline on this fix?

JustAnotherArchivist · 2022-04-01T18:56:32Z

I can't provide an ETA, though I hope to finish it soon. I have made the necessary changes locally, but verification that it is safe is still ongoing.

JustAnotherArchivist · 2022-04-05T02:56:09Z

Replacement PR is now up: #519

It includes all commits from this PR (although not exactly due to rebasing onto main).

bwoodsend marked this pull request as ready for review February 9, 2022 20:51

bwoodsend requested a review from hugovk February 9, 2022 20:52

JustAnotherArchivist reviewed Feb 10, 2022

View reviewed changes

bwoodsend marked this pull request as draft February 10, 2022 08:39

bwoodsend force-pushed the fix-buffer-overflows branch from 8450986 to f0e3aec Compare February 12, 2022 10:23

bwoodsend marked this pull request as ready for review February 12, 2022 10:28

JustAnotherArchivist mentioned this pull request Feb 12, 2022

Fix exceptions on encoding list or dict elements and non-overflow errors on int handling getting silenced #505

Merged

bwoodsend force-pushed the fix-buffer-overflows branch from f0e3aec to 86e5fa7 Compare February 12, 2022 21:58

hugovk reviewed Feb 13, 2022

View reviewed changes

tests/fuzz.py Outdated Show resolved Hide resolved

tests/fuzz.py Outdated Show resolved Hide resolved

bwoodsend added 3 commits February 13, 2022 14:02

Add a fuzzing test to search for segfaults in encoding.

9df3bc2

Remove the hidden JSON_NO_EXTRA_WHITESPACE compile knob.

e885025

Unsetting it can lead to seg-faults. I don't think it's worth having to fix and then test this undocumented permutation.

Fix some more seg-faults on encoding.

0f4e793

bwoodsend force-pushed the fix-buffer-overflows branch from 86e5fa7 to 3b31a77 Compare February 13, 2022 14:06

Add fuzz test to CI/CD.

6e7eeab

bwoodsend force-pushed the fix-buffer-overflows branch from 3b31a77 to 6e7eeab Compare February 13, 2022 14:09

hugovk reviewed Feb 13, 2022

View reviewed changes

hugovk added the changelog: Fixed For any bug fixes label Feb 19, 2022

cl0ne mentioned this pull request Feb 19, 2022

Bump ujson from 4.0.2 to 5.1.0 cl0ne/cryptopotato-bot#1

Closed

toolen mentioned this pull request Feb 21, 2022

Bump ujson from 4.0.2 to 5.1.0 bobuk/ubase#1

Open

gsnedders reviewed Feb 26, 2022

View reviewed changes

This was referenced Apr 5, 2022

Allow indent to be a str or None for compatibility with Python's json #517

Open

Fix buffer overflows (CVE-2021-45958) #519

Merged

bwoodsend closed this Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unchecked buffer overflows (CVE-2021-45958). #504

Fix unchecked buffer overflows (CVE-2021-45958). #504

bwoodsend commented Feb 9, 2022 •

edited by hugovk

bwoodsend commented Feb 9, 2022

codecov-commenter commented Feb 9, 2022 •

edited

bwoodsend commented Feb 9, 2022

bwoodsend commented Feb 9, 2022

JustAnotherArchivist commented Feb 10, 2022

JustAnotherArchivist Feb 10, 2022

bwoodsend commented Feb 10, 2022

hugovk commented Feb 10, 2022

hugovk commented Feb 10, 2022

JustAnotherArchivist commented Feb 10, 2022

bwoodsend commented Feb 12, 2022

bwoodsend commented Feb 12, 2022

JustAnotherArchivist commented Feb 12, 2022

bwoodsend commented Feb 12, 2022

JustAnotherArchivist commented Feb 13, 2022

bwoodsend commented Feb 13, 2022

hugovk left a comment

JustAnotherArchivist commented Feb 13, 2022

hugovk Feb 13, 2022

bwoodsend commented Feb 15, 2022

gsnedders Feb 26, 2022

gsnedders Feb 26, 2022

konr4d commented Mar 23, 2022

JustAnotherArchivist commented Mar 23, 2022

bwoodsend commented Mar 23, 2022

JustAnotherArchivist commented Mar 23, 2022

bwoodsend commented Mar 23, 2022

CarolynWebster commented Apr 1, 2022

JustAnotherArchivist commented Apr 1, 2022

JustAnotherArchivist commented Apr 5, 2022

	"May be 0 or 1 or 0,1 to testboth.",
	"May be 0 or 1 or 0,1 to test both.",

		import ujson


		class FuzzGenerator:

Fix unchecked buffer overflows (CVE-2021-45958). #504

Fix unchecked buffer overflows (CVE-2021-45958). #504

Conversation

bwoodsend commented Feb 9, 2022 • edited by hugovk

bwoodsend commented Feb 9, 2022

codecov-commenter commented Feb 9, 2022 • edited

Codecov Report

bwoodsend commented Feb 9, 2022

bwoodsend commented Feb 9, 2022

JustAnotherArchivist commented Feb 10, 2022

JustAnotherArchivist Feb 10, 2022

Choose a reason for hiding this comment

bwoodsend commented Feb 10, 2022

hugovk commented Feb 10, 2022

hugovk commented Feb 10, 2022

JustAnotherArchivist commented Feb 10, 2022

bwoodsend commented Feb 12, 2022

bwoodsend commented Feb 12, 2022

JustAnotherArchivist commented Feb 12, 2022

bwoodsend commented Feb 12, 2022

JustAnotherArchivist commented Feb 13, 2022

bwoodsend commented Feb 13, 2022

hugovk left a comment

Choose a reason for hiding this comment

JustAnotherArchivist commented Feb 13, 2022

hugovk Feb 13, 2022

Choose a reason for hiding this comment

bwoodsend commented Feb 15, 2022

gsnedders Feb 26, 2022

Choose a reason for hiding this comment

gsnedders Feb 26, 2022

Choose a reason for hiding this comment

konr4d commented Mar 23, 2022

JustAnotherArchivist commented Mar 23, 2022

bwoodsend commented Mar 23, 2022

JustAnotherArchivist commented Mar 23, 2022

bwoodsend commented Mar 23, 2022

CarolynWebster commented Apr 1, 2022

JustAnotherArchivist commented Apr 1, 2022

JustAnotherArchivist commented Apr 5, 2022

bwoodsend commented Feb 9, 2022 •

edited by hugovk

codecov-commenter commented Feb 9, 2022 •

edited