Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Use orjson in Python zygote #9715

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

Conversation

nwalters512
Copy link
Contributor

Recent benchmarking showed that Python's native json module, specifically json.loads, was a pretty significant bottleneck for questions that handled megabytes of data in prepare/parse/grade. orjson appears to be about an order of magnitude faster in these cases, parsing in 10ms instead of 100ms. We also benefit from built-in handling of "too large" integers, which saves us another 100ms or so before parsing.

I temporarily added logs with timing information and modified the code caller to print them out instead of producing errors, which has streamlined testing of different options. This should be removed before merging.

@nwalters512 nwalters512 added the do not merge This pull request should not be merged yet label Apr 10, 2024
@@ -147,7 +151,9 @@ def worker_loop() -> None:
# wait for a single line of input
json_inp = sys.stdin.readline()
# unpack the input line as JSON
inp = json.loads(json_inp, parse_int=zu.safe_parse_int)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orjson doesn't have customizable parsers that would allow us to retain out safe_parse_int behavior. Is this a dealbreaker?

If this is absolutely necessary, we could instead do this in a postprocessing step where we consider every value and if it's an integer that fails our is_int_json_serializable check, we replace it with float(value). However, a similar operation (assert_all_integers_within_limits) was actually a pretty large bottleneck, and it was great to be able to remove it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like orjson might throw an exception if it runs into an oversized int: https://github.com/ijl/orjson?tab=readme-ov-file#int

EDIT: So actually, it seems like this replicates the assert_all_integers_within_limits for free! You should even be able to throw it into old test cases and see if it works.

image

EDIT2: Jk, I think I misread this. But doing this pre or post processing might not be that bad. Is there a way to handle this on the serializing for the javascript side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm already taking advantage of the "strict int" behavior for serialization, but it doesn't help us on deserialization where we want to coerce "too big" ints into floats.

I started thinking about doing it in the JS side, but that wouldn't help us because JS can't losslessly represent those ints, which is the root of the problem! This begs the question though: how would we ever find ourselves in a situation where we're trying to parse one of these too-large integers? The only way they could get into params and such is by coming out of the Python worker, but we already forbid that, right?

I suppose these JSON objects can also be modified via CSV uploads, but those also flow through JS. Maybe we just need some extra validation there?

I should look back at our original notes and PR to see why we felt we needed this special parsing behavior in the first place. Surely we thought about all this before?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should look back at our original notes and PR to see why we felt we needed this special parsing behavior in the first place. Surely we thought about all this before?

Yeah, IIRC there's detailed discussion about these issues, and my recall is definitely flawed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the dev meeting, we do need this behavior, and here's why: #8462 (comment)

I opened an issue to see if they'd consider making this opt-in native behavior: https://redirect.github.com/ijl/orjson/issues/473

try:
zu.assert_all_integers_within_limits(obj)
return json.dumps(obj, sort_keys=sort_keys, allow_nan=allow_nan)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in behavior: orjson doesn't support throwing if NaN is encountered; instead it just serializes it to null. This is consistent with how JSON.stringify(NaN) behaves in JavaScript, so I think this is a reasonable change. Just calling this out though.

try:
zu.assert_all_integers_within_limits(obj)
return json.dumps(obj, sort_keys=sort_keys, allow_nan=allow_nan)
dump_start = time.time()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dump_start = time.time()
dump_start = time.perf_counter()

Copy link
Contributor

github-actions bot commented Apr 11, 2024

All images

Image Platform Old Size New Size Change
prairielearn/executor:477a7fd2a12ff8aeb2ca0092642bcdb7bf0e2f9c null 1631.15 MB 1630.82 MB -0.02%
prairielearn/plbase:477a7fd2a12ff8aeb2ca0092642bcdb7bf0e2f9c linux/amd64 1441.11 MB 1440.77 MB -0.02%
prairielearn/prairielearn:477a7fd2a12ff8aeb2ca0092642bcdb7bf0e2f9c linux/amd64 1631.15 MB 1630.81 MB -0.02%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge This pull request should not be merged yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants