Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A crash issue happens during fuzzing test #537

Closed
baltsers opened this issue May 12, 2022 · 6 comments · Fixed by #530
Closed

A crash issue happens during fuzzing test #537

baltsers opened this issue May 12, 2022 · 6 comments · Fixed by #530
Labels
bug Something isn't working

Comments

@baltsers
Copy link

baltsers commented May 12, 2022

What did you do?

We did a fuzzing test on ultrajson, a crash issue happened.

What did you expect to happen?

python should not crash with any inputs

What actually happened?

Segmentation fault.

Starting program: /root/anaconda3/bin/python ujson_poc.py input
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
PyBytes_Size () at /tmp/build/80754af9/python-split_1631797238431/work/Objects/bytesobject.c:1199
1199    /tmp/build/80754af9/python-split_1631797238431/work/Objects/bytesobject.c: No such file or directory.
(gdb) bt
#0  PyBytes_Size () at /tmp/build/80754af9/python-split_1631797238431/work/Objects/bytesobject.c:1199
#1  0x00007ffff7e25e99 in ?? () from /root/anaconda3/lib/python3.9/site-packages/ujson.cpython-39-x86_64-linux-gnu.so
#2  0x00007ffff7e24b94 in ?? () from /root/anaconda3/lib/python3.9/site-packages/ujson.cpython-39-x86_64-linux-gnu.so
#3  0x00007ffff7e253cf in JSON_EncodeObject () from /root/anaconda3/lib/python3.9/site-packages/ujson.cpython-39-x86_64-linux-gnu.so
#4  0x00007ffff7e26e93 in objToJSON () from /root/anaconda3/lib/python3.9/site-packages/ujson.cpython-39-x86_64-linux-gnu.so
#5  0x00005555556c8714 in cfunction_call () at /tmp/build/80754af9/python-split_1631797238431/work/Objects/methodobject.c:543
#6  0x00005555556989ef in _PyObject_MakeTpCall () at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:191
#7  0x0000555555722d89 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff7ed5758, callable=<optimized out>, tstate=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:116
#8  PyObject_Vectorcall () at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:127
#9  call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x555555914680)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:5075
#10 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:3487
#11 0x00005555556d68e2 in _PyEval_EvalFrame () at /tmp/build/80754af9/python-split_1631797238431/work/Include/internal/pycore_ceval.h:40
#12 _PyEval_EvalCode () at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4327
#13 0x0000555555788bac in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=0, 
    kwargs=<optimized out>, kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=<optimized out>, globals=<optimized out>, 
    _co=<optimized out>) at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4359
#14 PyEval_EvalCodeEx () at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4375
#15 0x00005555556d79eb in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:826
#16 0x0000555555788c5b in run_eval_code_obj () at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:1219
#17 0x00005555557bc705 in run_mod () at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:1240
#18 0x000055555566160d in pyrun_file (fp=0x555555976b20, filename=0x7ffff6c7dc00, start=<optimized out>, globals=0x7ffff7f01f80, locals=0x7ffff7f01f80, closeit=1, 
    flags=0x7fffffffdfa8) at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:1138
#19 0x00005555557c149f in pyrun_simple_file (flags=0x7fffffffdfa8, closeit=1, filename=0x7ffff6c7dc00, fp=0x555555976b20)
    at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:449
#20 PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:482
#21 0x00005555557c1c7f in pymain_run_file (cf=0x7fffffffdfa8, config=0x555555912f90) at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:379
#22 pymain_run_python (exitcode=0x7fffffffdfa0) at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:604
#23 Py_RunMain () at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:683
#24 0x00005555557c1d79 in Py_BytesMain () at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:1129
#25 0x00007ffff703fbf7 in __libc_start_main (main=0x555555669d80 <main>, argc=3, argv=0x7fffffffe198, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffe188) at ../csu/libc-start.c:310
#26 0x0000555555746bc3 in _start ()

What versions are you using?

  • OS: Ubuntu18.04
  • Python: 3.9
  • UltraJSON: latest (the main branch)

Please include code that reproduces the issue.
Code and Input,

@hugovk hugovk added the bug Something isn't working label May 12, 2022
@JustAnotherArchivist
Copy link
Collaborator

Equivalent reproducer as a single line:

ujson.dumps({'u26¶1\udddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd\x99000125d2522': 4})

I narrowed it down to this:

ujson.dumps({'\udddd': 1})

Dumping the string on its own, in a list, or as a dict value gives a surrogate-related UnicodeEncodeError. This suggests that the error isn't properly propagated on dict key encoding.

@JustAnotherArchivist
Copy link
Collaborator

I think this happens because the return value of PyUnicode_AsUTF8String in Dict_iterNext on conversion to a C string is never checked. It returns NULL because the surrogate can't be encoded, and then later in Dict_iterGetName, this causes a NULL pointer dereferencing. The issue is effectively fixed by #530 since that allows for surrogates in strings. However, that PR does not add error checks, so there could still be other scenarios that would trigger the same crash.

@JustAnotherArchivist
Copy link
Collaborator

I looked a bit through the CPython code, and apart from running out of memory or similar generic things, I can't think of a way to trigger similar errors with #530 included. Neither private-use nor unallocated Unicode codepoints cause any error. So while this is easy enough to fix, I'm not sure how we can test that it's actually fixed after the inclusion of that PR. In the bigger picture of the entire library, that could be a pretty significant problem I think. If we can't find specific ways of triggering errors in every Python C API function call (except the ones guaranteed to always work, like PyUnicode_Check), perhaps it might be worth thinking about a fuzz tester that wraps those calls and randomly pretends they failed. Or maybe something like that already exists? Although I couldn't find anything in a quick search.

@bwoodsend
Copy link
Collaborator

Just to confirm, you want an if PyUnicode_AsUTF8String() output is NULL: abort check in there but once #530 is merged, we'll have no way to test that if branch?

I suppose if you were really determined to get that test coverage you could do something ugly like (note I haven't tested this at all):

#ifndef DEBUG
#define PyUnicode_AsUTF8String_ PyUnicode_AsUTF8String
#else

bool sabotage_PyUnicode_AsUTF8String = false;

PyObject *PyUnicode_AsUTF8String_(PyObject *unicode)
{
  if (sabotage_PyUnicode_AsUTF8String) {
    return NULL;
  }
  return PyUnicode_AsUTF8String(unicode);
}
#endif

Then replace PyUnicode_AsUTF8String() with PyUnicode_AsUTF8String_() throughout and have a test which temporarily sets sabotage_PyUnicode_AsUTF8String to true.

@JustAnotherArchivist
Copy link
Collaborator

Yes, that's correct. At least no easy way to cause an error specifically on that call as far as I can tell. Causing errors randomly would probably be possible with ulimit to simulate running out of memory. But controlling that to a specific PyUnicode_AsUTF8String (or PyUnicode_AsEncodedString after #530) call would be ... challenging and likely prone to break.

That kind of code is more or less what I was thinking, yeah, albeit not just for this particular function; there are other calls that have the same general problem (cf. #505 (comment)). Also, as you wrote it, it probably wouldn't suffice. If the test code can only control the sabotage wholesale from Python, a call to e.g. ujson.dumps will be sabotaged entirely. So it would still be impossible to test an encoding error on a dict value because the key encoding would already fail. That's why I'm thinking more in the direction of fuzzing.

@JustAnotherArchivist
Copy link
Collaborator

I looked into the C API error fuzzing idea over the past week and also asked in the Python community IRC about it. I couldn't find anything like it, nor could anyone point me to something, and I think many C extension projects could benefit from a generic tool for that; I'm sure there are numerous segfaults lurking out there. I have a PoC, but actually implementing it and getting it to a somewhat usable state will take a while.

I think we can consider this issue fixed by #530 though since it does resolve the problem of surrogates in dict keys causing a segfault.

JustAnotherArchivist added a commit to JustAnotherArchivist/ultrajson that referenced this issue May 30, 2022
This allows surrogates anywhere in the input, compatible with the json module from the standard library.

This also refactors two interfaces:
- The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context.
- Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them.

Fixes ultrajson#156
Fixes ultrajson#447
Fixes ultrajson#537
Supersedes ultrajson#284
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants