Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve IpAddr/SocketAddr serialization by avoiding Display #2001

Merged
merged 4 commits into from Mar 22, 2021

Conversation

saethlin
Copy link
Contributor

@saethlin saethlin commented Mar 16, 2021

I'm opening a draft PR because I'd appreciate feedback, but this is definitely not done.

In some profiling, I've found that serializing Ipv4Addr to JSON is disproportionately slow. This (currently crude) patch is ~4x faster. I'm no expert on integer formatting, but I think I can improve the implementation by running cargo expand on itoa then pasting in the chunks I need. As far as I can tell, going through itoa's public interface is not faster than this code because LLVM keeps the itoa::Buffer in memory. Done. The algorithm is faster.

Is there any concern with not using std::fmt::Display? As far as I can tell from experimenting, there's basically no opportunity to make the standard library formatting code faster because it uses so much dynamic dispatch.

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to share your benchmark showing 4x faster serialization to JSON?

@saethlin
Copy link
Contributor Author

🤦 Yes

use std::net::Ipv4Addr;
fn main() {
    let mut buf = [0u8; 15];
    for _ in 0..10_000_000 {
        ser_ip(&mut buf, &Ipv4Addr::new(127, 10, 0, 1));
    }
}

#[inline(never)]
fn ser_ip(buf: &mut [u8], thing: &Ipv4Addr) {
    serde_json::to_writer(buf, thing).unwrap();
}
$ cargo build --release
$ perf stat -r10 ./target/release/scratch

I'm aware this is not a typical way to do benchmarking in the Rust community, but I find it is much more stable than using cargo bench or criterion, probably because it averages over a few process startups.

@dtolnay
Copy link
Member

dtolnay commented Mar 21, 2021

Would it be worth comparing against an unsafe version that removes the zero initialization of the whole buffer and the utf8 validation? My expectation from benchmarking itoa in the past is that those unnecessary things account for 30-40% of this code still. We can also consider removing all those bounds checks in format_u8 but that tends not to be measurable in a microbenchmark, since they're 100% correctly predicted, and the performance implication is only in polluting the branch predictor.

@saethlin
Copy link
Contributor Author

saethlin commented Mar 22, 2021

Would it be worth comparing against an unsafe version that removes the zero initialization of the whole buffer and the utf8 validation? My expectation from benchmarking itoa in the past is that those unnecessary things account for 30-40% of this code still.

In this case, it appears the utf8 validation is ~30% of the benchmark's runtime. Adding #[inline] to the function in the standard library causes it to be inlined but not cleaned up very much, but even so this takes off half the overhead. Still seems like a silly thing to be spending cycles and code size on.

As far as I can tell, leaving the buffer uninitialized doesn't help at all. But initializing it with b'.' instead of 0u8 and skipping over writing each b'.' knocks a few percent off the runtime. The annotated assembly from perf suggests the initialization instructions are ~0.10% of total runtime. Maybe things look different in a real program but they can't be that much different. Maybe this will matter more for Ipv6 addresses.

We can also consider removing all those bounds checks in format_u8

As far as I can tell, there are no bounds checks emitted. It looks like LLVM correctly figures out that the buffer is large enough.

After these improvements, the big hog in the benchmark is calling serialize_str which is ~57% of runtime, and the time is spent in the JSON escaping loop and the memset calls that LLVM emits to write the string delimiters (which I've reported here). It would be nifty to skip over the escaping, but it's unclear to me if that's possible.

@dtolnay
Copy link
Member

dtolnay commented Mar 22, 2021

Thanks -- LGTM once tests pass.

@saethlin
Copy link
Contributor Author

saethlin commented Mar 22, 2021

I've made the tests pass; do we want to give the same treatment to SocketAddrV4 or the IpV6 types? Or save that for another PR?

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good.

Yes I am open to optimizing SocketAddrV4 and the V6 types too, if it matters to anyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants