[WIP] Use orjson in Python zygote #9715

nwalters512 · 2024-04-10T23:52:12Z

Recent benchmarking showed that Python's native json module, specifically json.loads, was a pretty significant bottleneck for questions that handled megabytes of data in prepare/parse/grade. orjson appears to be about an order of magnitude faster in these cases, parsing in 10ms instead of 100ms. We also benefit from built-in handling of "too large" integers, which saves us another 100ms or so before parsing.

I temporarily added logs with timing information and modified the code caller to print them out instead of producing errors, which has streamlined testing of different options. This should be removed before merging.

nwalters512 · 2024-04-10T23:56:03Z

apps/prairielearn/python/zygote.py

@@ -147,7 +151,9 @@ def worker_loop() -> None:
            # wait for a single line of input
            json_inp = sys.stdin.readline()
            # unpack the input line as JSON
-            inp = json.loads(json_inp, parse_int=zu.safe_parse_int)


orjson doesn't have customizable parsers that would allow us to retain out safe_parse_int behavior. Is this a dealbreaker?

If this is absolutely necessary, we could instead do this in a postprocessing step where we consider every value and if it's an integer that fails our is_int_json_serializable check, we replace it with float(value). However, a similar operation (assert_all_integers_within_limits) was actually a pretty large bottleneck, and it was great to be able to remove it.

It seems like orjson might throw an exception if it runs into an oversized int: https://github.com/ijl/orjson?tab=readme-ov-file#int

EDIT: So actually, it seems like this replicates the assert_all_integers_within_limits for free! You should even be able to throw it into old test cases and see if it works.

EDIT2: Jk, I think I misread this. But doing this pre or post processing might not be that bad. Is there a way to handle this on the serializing for the javascript side?

Yeah, I'm already taking advantage of the "strict int" behavior for serialization, but it doesn't help us on deserialization where we want to coerce "too big" ints into floats.

I started thinking about doing it in the JS side, but that wouldn't help us because JS can't losslessly represent those ints, which is the root of the problem! This begs the question though: how would we ever find ourselves in a situation where we're trying to parse one of these too-large integers? The only way they could get into params and such is by coming out of the Python worker, but we already forbid that, right?

I suppose these JSON objects can also be modified via CSV uploads, but those also flow through JS. Maybe we just need some extra validation there?

I should look back at our original notes and PR to see why we felt we needed this special parsing behavior in the first place. Surely we thought about all this before?

I should look back at our original notes and PR to see why we felt we needed this special parsing behavior in the first place. Surely we thought about all this before?

Yeah, IIRC there's detailed discussion about these issues, and my recall is definitely flawed.

As discussed in the dev meeting, we do need this behavior, and here's why: #8462 (comment)

I opened an issue to see if they'd consider making this opt-in native behavior: https://redirect.github.com/ijl/orjson/issues/473

nwalters512 · 2024-04-10T23:57:33Z

apps/prairielearn/python/zygote.py

    try:
-        zu.assert_all_integers_within_limits(obj)
-        return json.dumps(obj, sort_keys=sort_keys, allow_nan=allow_nan)


Change in behavior: orjson doesn't support throwing if NaN is encountered; instead it just serializes it to null. This is consistent with how JSON.stringify(NaN) behaves in JavaScript, so I think this is a reasonable change. Just calling this out though.

eliotwrobson · 2024-04-11T00:07:56Z

apps/prairielearn/python/zygote.py

    try:
-        zu.assert_all_integers_within_limits(obj)
-        return json.dumps(obj, sort_keys=sort_keys, allow_nan=allow_nan)
+        dump_start = time.time()


Suggested change

dump_start = time.time()

dump_start = time.perf_counter()

github-actions · 2024-04-11T00:10:06Z

All images

Image	Platform	Old Size	New Size	Change
prairielearn/executor:477a7fd2a12ff8aeb2ca0092642bcdb7bf0e2f9c	null	1631.15 MB	1630.82 MB	-0.02%
prairielearn/plbase:477a7fd2a12ff8aeb2ca0092642bcdb7bf0e2f9c	linux/amd64	1441.11 MB	1440.77 MB	-0.02%
prairielearn/prairielearn:477a7fd2a12ff8aeb2ca0092642bcdb7bf0e2f9c	linux/amd64	1631.15 MB	1630.81 MB	-0.02%

nwalters512 added 2 commits April 10, 2024 16:43

Use orjson in Python zygote

33ba3e8

Remove zygote_utils

e1b1e4e

nwalters512 added the do not merge This pull request should not be merged yet label Apr 10, 2024

nwalters512 commented Apr 10, 2024

View reviewed changes

eliotwrobson reviewed Apr 11, 2024

View reviewed changes

eliotwrobson mentioned this pull request Apr 15, 2024

Add orjson to python-requirements.txt #8699

Closed

Serialize numpy values

35d5879

nwalters512 mentioned this pull request Apr 17, 2024

Forbid returning different data object in zygote #9733

Merged

nwalters512 added 2 commits April 17, 2024 14:49

Merge branch 'master' into zygote-use-orjson

fcf6667

Remove references to deleted files

477a7fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Use orjson in Python zygote #9715

[WIP] Use orjson in Python zygote #9715

nwalters512 commented Apr 10, 2024

nwalters512 Apr 10, 2024

eliotwrobson Apr 11, 2024 •

edited

nwalters512 Apr 11, 2024

eliotwrobson Apr 11, 2024 •

edited

nwalters512 Apr 11, 2024

nwalters512 Apr 10, 2024

eliotwrobson Apr 11, 2024

github-actions bot commented Apr 11, 2024 •

edited

[WIP] Use orjson in Python zygote #9715

Are you sure you want to change the base?

[WIP] Use orjson in Python zygote #9715

Conversation

nwalters512 commented Apr 10, 2024

nwalters512 Apr 10, 2024

Choose a reason for hiding this comment

eliotwrobson Apr 11, 2024 • edited

Choose a reason for hiding this comment

nwalters512 Apr 11, 2024

Choose a reason for hiding this comment

eliotwrobson Apr 11, 2024 • edited

Choose a reason for hiding this comment

nwalters512 Apr 11, 2024

Choose a reason for hiding this comment

nwalters512 Apr 10, 2024

Choose a reason for hiding this comment

eliotwrobson Apr 11, 2024

Choose a reason for hiding this comment

github-actions bot commented Apr 11, 2024 • edited

All images

eliotwrobson Apr 11, 2024 •

edited

eliotwrobson Apr 11, 2024 •

edited

github-actions bot commented Apr 11, 2024 •

edited