Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for arbitrary size integers #548

Merged
merged 6 commits into from Jun 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions lib/ultrajson.h
Expand Up @@ -332,6 +332,7 @@ typedef struct __JSONObjectDecoder
JSOBJ (*newInt)(void *prv, JSINT32 value);
JSOBJ (*newLong)(void *prv, JSINT64 value);
JSOBJ (*newUnsignedLong)(void *prv, JSUINT64 value);
JSOBJ (*newIntegerFromString)(void *prv, char *value, size_t length);
JSOBJ (*newDouble)(void *prv, double value);
void (*releaseObject)(void *prv, JSOBJ obj);
JSPFN_MALLOC malloc;
Expand Down
5 changes: 4 additions & 1 deletion lib/ultrajsondec.c
Expand Up @@ -173,7 +173,10 @@ static FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_numeric (struct DecoderState *ds
{
if (hasError)
{
return SetError(ds, -1, intNeg == 1 ? "Value is too big" : "Value is too small");
char *strStart = ds->start;
ds->lastType = JT_INT;
ds->start = offset;
return ds->dec->newIntegerFromString(ds->prv, strStart, offset - strStart);
}
goto BREAK_INT_LOOP;
break;
Expand Down
10 changes: 10 additions & 0 deletions python/JSONtoObj.c
Expand Up @@ -119,6 +119,15 @@ static JSOBJ Object_newUnsignedLong(void *prv, JSUINT64 value)
return PyLong_FromUnsignedLongLong (value);
}

static JSOBJ Object_newIntegerFromString(void *prv, char *value, size_t length)
{
// PyLong_FromString requires a NUL-terminated string in CPython, contrary to the documentation: https://github.com/python/cpython/issues/59200
char *buf = PyObject_Malloc(length + 1);
memcpy(buf, value, length);
buf[length] = '\0';
return PyLong_FromString(buf, NULL, 10);
}

Comment on lines +122 to +130
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memcpy here is certainly not ideal. It would probably be faster to buffer the next byte, replace it with a NUL, feed the char array through PyLong_FromString, and then restore it. I think that should always be possible/safe without buffer overruns, but I haven't spent a great deal of thought about it yet. Do you think it's worth exploring that sort of hack?

Copy link
Collaborator

@bwoodsend bwoodsend Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, now that's a disgusting but tantalising thing to do. Two ways I imagine that causing damage are:

  1. The integer we're parsing is the last part of the JSON so the next byte would be out of bounds.
  2. Someone is reading the JSON string in multiple threads and the second thread happens to look at that byte when it's temporarily nullified.

1 is moot if the string we're parsing is NULL terminated (which I'm fairly certain that they already are). 2 feels like a very nasty way to derail something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant part of the decoder boils down to:

  1. If the argument is a string, call PyUnicode_AsUTF8String to convert to bytes.
  2. Call PyBytes_AsString, which returns a NULL-terminated char* that is the internal buffer of the bytes object.
  3. Iterate over that buffer.

So no out-of-bounds issue. As for multithreading, yes, that could be an issue (if the user passes bytes). But on the other hand, ujson already isn't thread-safe as far as I can tell. For example, modifying a dict from another thread while encoding it should crash the encoder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some tests on this, parsing an array equivalent to [2**64 + i for i in range(1000)], i.e. 1000 ints all using this code path. On my machine, parsing such a string averages 128 μs with memcpy and 108 μs with the suggested hack.

Compared to the other code paths for smaller ints (note that the number of digits obviously has a direct impact on throughput as well):

input time
2**64 + i, memcpy 128 μs
2**64 + i, buffer hack 108 μs
2**62 + i (newUnsignedLong) 60 μs
-2**63 + i (newLong) 64 μs
2**30 + i (newInt) 45 μs
-2**31 + i (newInt) 45 μs

Command, for reference: python3 -m timeit -s 'import ujson; s = "[" + ",".join(str(2**64 + i) for i in range(1000)) + "]"' 'ujson.loads(s)'

Implementation of the hack:

static JSOBJ Object_newIntegerFromString(void *prv, char *value, size_t length)
{
  char buf = value[length];
  value[length] = '\0';
  PyObject *out = PyLong_FromString(value, NULL, 10);
  value[length] = buf;
  return out;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a pretty small speedup really. On my machine it's even less (115μs vs 110μs). Definitely not worth the hack in my mind.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree. It's basically only relevant if all you ever do is parse huge integers. I don't think that's common enough to warrant the ugly hack.

static JSOBJ Object_newDouble(void *prv, double value)
{
return PyFloat_FromDouble(value);
Expand Down Expand Up @@ -152,6 +161,7 @@ PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs)
Object_newInteger,
Object_newLong,
Object_newUnsignedLong,
Object_newIntegerFromString,
Object_newDouble,
Object_releaseObject,
PyObject_Malloc,
Expand Down
21 changes: 21 additions & 0 deletions python/objToJSON.c
Expand Up @@ -100,6 +100,17 @@ static void *PyLongToUINT64(JSOBJ _obj, JSONTypeContext *tc, void *outValue, siz
return NULL;
}

static void *PyLongToINTSTR(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen)
{
PyObject *obj = PyNumber_ToBase(_obj, 10);
if (!obj)
{
return NULL;
}
*_outLen = PyUnicode_GET_LENGTH(obj);
return PyUnicode_1BYTE_DATA(obj);
}

static void *PyFloatToDOUBLE(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen)
{
PyObject *obj = (PyObject *) _obj;
Expand Down Expand Up @@ -508,6 +519,16 @@ static void Object_beginTypeContext (JSOBJ _obj, JSONTypeContext *tc, JSONObject
exc = PyErr_Occurred();
}

if (exc && PyErr_ExceptionMatches(PyExc_OverflowError))
{
PyErr_Clear();
pc->PyTypeToJSON = PyLongToINTSTR;
tc->type = JT_RAW;
// Overwritten by PyLong_* due to the union, which would lead to a DECREF in endTypeContext.
GET_TC(tc)->rawJSONValue = NULL;
return;
}

if (exc)
{
PRINTMARK();
Expand Down
39 changes: 26 additions & 13 deletions tests/test_ujson.py
Expand Up @@ -600,6 +600,32 @@ def test_decode_numeric_int_exp(test_input):
assert output == json.loads(test_input)


@pytest.mark.parametrize(
"i",
[
-(10**25), # very negative
-(2**64), # too large in magnitude for a uint64
-(2**63) - 1, # too small for a int64
2**64, # too large for a uint64
10**25, # very positive
],
)
@pytest.mark.parametrize("mode", ["encode", "decode"])
def test_encode_decode_big_int(i, mode):
# Test ints that are too large to be represented by a C integer type
for python_object in (i, [i], {"i": i}):
json_string = json.dumps(python_object, separators=(",", ":"))
if mode == "encode":
if hasattr(sys, "pypy_version_info"):
# https://foss.heptapod.net/pypy/pypy/-/issues/3765
pytest.skip("PyPy can't serialise big ints")
assert ujson.encode(python_object) == json_string
if isinstance(python_object, dict):
assert ujson.encode(python_object, sort_keys=True) == json_string
else:
assert ujson.decode(json_string) == python_object


@pytest.mark.parametrize(
"test_input, expected",
[
Expand Down Expand Up @@ -636,15 +662,7 @@ def test_decode_range_raises(test_input, expected):
("[,31337]", ujson.JSONDecodeError), # array leading comma fail
("[,]", ujson.JSONDecodeError), # array only comma fail
("[]]", ujson.JSONDecodeError), # array unmatched bracket fail
("18446744073709551616", ujson.JSONDecodeError), # too big value
("-90223372036854775809", ujson.JSONDecodeError), # too small value
("-23058430092136939529", ujson.JSONDecodeError), # too small value
("-11529215046068469760", ujson.JSONDecodeError), # too small value
("18446744073709551616", ujson.JSONDecodeError), # very too big value
("23058430092136939529", ujson.JSONDecodeError), # too big value
("-90223372036854775809", ujson.JSONDecodeError), # very too small value
("{}\n\t a", ujson.JSONDecodeError), # with trailing non whitespaces
("[18446744073709551616]", ujson.JSONDecodeError), # array with big int
('{"age", 44}', ujson.JSONDecodeError), # read bad object syntax
],
)
Expand Down Expand Up @@ -718,11 +736,6 @@ def test_dumps_raises(test_input, expected_exception, expected_message):
(float("nan"), OverflowError),
(float("inf"), OverflowError),
(-float("inf"), OverflowError),
(12839128391289382193812939, OverflowError),
([12839128391289382193812939], OverflowError),
([12839128391289382193812939, 42], OverflowError),
({"a": 12839128391289382193812939}, OverflowError),
({"a": 12839128391289382193812939, "b": 42}, OverflowError),
],
)
def test_encode_raises_allow_nan(test_input, expected_exception):
Expand Down