Optimize `serde_json::from_reader` #1007

tgnottingham · 2023-04-02T22:24:44Z

This improves serde_json::from_reader performance by adding internal buffering and various optimizations, partially resolving #160. The optimization in which copying into the scratch space is avoided was inspired by @bouk's work (#160 (comment)) -- thank you, @bouk.

Buffering is used regardless of whether or not the client provides a BufRead Reader. While this means there are two layers of buffering when the client uses a BufRead, the performance turns out to be better if they do use one anyway, so the main downside is the additional memory use. This approach also eliminates the foot-gun of using from_reader without a BufRead.

I'm wondering if an approach like this would be accepted. If so, I'd improve the test coverage before submitting.

Using the example from #160 (comment) as a benchmark (taking middle of five measurements):

Before:

Without BufReader: 13.460s
With BufReader: 0.213s

After:

Without BufReader, 128 byte internal buffer: 0.170s (98% improvement from baseline)
Without BufReader, 2048 byte internal buffer: 0.069s (99% improvement from baseline, 59% from 128 byte buffer)
Without BufReader, 4096 byte internal buffer: 0.066s (99.5% improvement from baseline, 61% from 128 byte buffer)
With BufReader, 128 byte internal buffer: 0.067s (69% improvement from baseline)
With BufReader, 2048 byte internal buffer: 0.062s (71% improvement from baseline, 7% from 128 byte buffer)
With BufReader, 4096 byte internal buffer: 0.063s (70% improvement from baseline, 6% from 128 byte buffer)

serde_json::from_vec, in comparison, takes 0.063s including reading the file into a Vec.

Benchmark code can be found here -- https://gist.github.com/tgnottingham/9eb4b84ea82a6578628a7b7ae2ffc191.

Using a different benchmark:

Before:

Without BufReader: 16.147s
With BufReader: 0.346s

After:

Without BufReader, 128 byte internal buffer: 0.317s (98% improvement from baseline)
Without BufReader, 2048 byte internal buffer: 0.192s (98.8% improvement from baseline, 39% from 128 byte buffer)
Without BufReader, 4096 byte internal buffer: 0.187s (98.8% improvement from baseline, 41% from 128 byte buffer)
With BufReader, 128 byte internal buffer: 0.188s (46% improvement from baseline)
With BufReader, 2048 byte internal buffer: 0.186s (46% improvement from baseline, 1% from 128 byte buffer)
With BufReader, 4096 byte internal buffer: 0.186s (46% improvement from baseline, 1% from 128 byte buffer)

serde_json::from_vec, in comparison, takes 0.170s including reading the file into a Vec.

Benchmark code can be found here -- https://gist.github.com/tgnottingham/950569d1fd85d58116b32614bddcd03a.

A 2048 byte buffer was the smallest of power-of-two size where the performance of not using BufReader started to compete with using one. A 4096 byte buffer improved performance slightly more so that they were almost on par.

The PR currently uses a 128 byte buffer to be conservative with memory, but I'm interested in knowing if a larger buffer size would be accepted, since it improves run time considerably when a BufReader isn't used.

Logarithmus · 2023-05-23T17:43:56Z

Thanks you very much for your work! Unfortunately the author of this project lacks time for it (@dtolnay), so be prepared to wait for YEARS to merge your PR. Using a fork would be easier.

dtolnay

Thanks for the PR!

An internal buffer that applies to serde_json::Deserializer::from_reader is not going to be acceptable. The caller is allowed to read additional content from the stream that comes after the JSON data, directly from the original stream. For example this PR breaks the following use case:

use serde::Deserialize;
use serde_json::Value;

fn main() {
    let mut reader = br#"[1] {"...":true}"#.as_slice();

    let header = Value::deserialize(&mut serde_json::Deserializer::from_reader(&mut reader)).unwrap();
    println!("{:?}", header);

    let body = Value::deserialize(&mut serde_json::Deserializer::from_reader(&mut reader)).unwrap();
    println!("{:?}", body);
}

Before:

Array [Number(1)]
Object {"...": Bool(true)}

After:

Array [Number(1)]
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error("EOF while parsing a value", line: 1, column: 0)', src/main.rs:10:92
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

For serde_json::from_reader specifically, that concern does not apply because it enforces the stream only contains a JSON value. We potentially could buffer inside of serde_json::from_reader without buffering serde_json::Deserializer::from_reader. However, I would be concerned about the wisdom of maintaining 2 different reader implementations (and forcing users to pay for that in serde_json compile time) when a caller could achieve the same effect by using BufReader.

I haven't looked closely at what else is in the PR but you said there are "various optimizations" lumped in here aside from the internal buffering. If any of those can be decoupled from an internal buffer, I would be open to reviewing those in a separate PR.

tgnottingham · 2023-06-30T03:14:32Z

Thanks for response! Since the buffering approach won't work, I'll go ahead and close this.

tgnottingham added 5 commits April 1, 2023 15:12

Buffer internally in IoRead

4cc38b2

Replace BufReader with our own buffer in IoRead

652c97c

Avoid updating position on each byte read in IoRead

5b7385f

Have read::error take Position instead of Read

c201ba5

Avoid copying into scratch space when possible

a16902c

tgnottingham force-pushed the buffer-io-read branch from b18a343 to a16902c Compare April 2, 2023 22:33

tgnottingham added 2 commits April 4, 2023 20:26

Use Position::update in more places

77c1a79

Clarify comment

7299551

dtolnay requested changes Jun 16, 2023

View reviewed changes

tgnottingham closed this Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `serde_json::from_reader` #1007

Optimize `serde_json::from_reader` #1007

tgnottingham commented Apr 2, 2023 •

edited

Logarithmus commented May 23, 2023

dtolnay left a comment

tgnottingham commented Jun 30, 2023

Optimize serde_json::from_reader #1007

Optimize serde_json::from_reader #1007

Conversation

tgnottingham commented Apr 2, 2023 • edited

Logarithmus commented May 23, 2023

dtolnay left a comment

Choose a reason for hiding this comment

tgnottingham commented Jun 30, 2023

Optimize `serde_json::from_reader` #1007

Optimize `serde_json::from_reader` #1007

tgnottingham commented Apr 2, 2023 •

edited