Experiment with using `bytes::Bytes` to back bytes and string fields #190

nrc · 2019-05-30T03:28:20Z

This PR uses bytes::Bytes as the type for bytes and a new type BytesString as the type for string (rather than Vec<u8> and String). BytesStringis a wrapper around Bytes in the same way that String is a wrapper around Vec<u8>. Obviously this is a severe breaking change.

This idea has been discussed in #31 and #35.

This PR improves performance by 25% (!) on Mac OS, 9% on Ubuntu, and 2% on CentOS. This was tested using a fairly realistic benchmark dervied from TiKV, however, the bytes and string values were all quite small. I expect the performance improvements would be smaller for larger values. I also tested a multi-threaded version of the benchmark on Ubuntu, there the average improvement was only 1.5%, however, the worst-performing thread improved by 14%, so the overall time to completion was significantly improved.

If we should go forward with this PR, then I think there are few open questions, but I thought I'd post early to get feedback about whether it was possible to land at all, given the back-compat situation.

If we do land, I think we can probably get rid of the BytesMutString type and just use BytesString - I didn't see any performance benefit or usability benefit in using the BytesMut version.

One option might be to land the change to bytes but not to string. I feel like that is a less invasive change but will get a decent benefit for some users.

@danburkert what do you think?

Signed-off-by: Nick Cameron <nrc@ncameron.org>

mzabaluev · 2019-06-07T06:04:01Z

Coincidentally, I put some time into a yet-unpublished crate to provide a UTF-8 invariant enforcement wrapper for Bytes. I intend to use it in a text encoding library for Tokio, but it could be reused here as well, rather than reinventing the string wrapper specifically for prost.

danburkert · 2019-06-07T15:18:50Z

Hey @nrc thanks for the work on this. I’m definitely in support of this direction.

I’d prefer if this were made configurable as an option on Prost-build in the same way that hashmap/btree map is. That way the default can stay as the std types, and the bytes based types can be used if/when perf issues arise.

I agree re. Not maintaining our own string wrapper over bytes, it seems like well want to use whatever shakes out as the community choice.

Re. Performance, in a bit surprised that this resulted in improvements. My understanding is that this PR isn’t using ref count sharing to clone into the fields, right? If not, what explains the perf difference?

danburkert · 2019-06-07T15:29:24Z

My understanding is that this PR isn’t using ref count sharing to clone into the fields, right?

Ah, so I'd been thinking about this entirely from the perspective of decoding performance, but maybe the benchmark is measuring time to create & serialize a message?

mzabaluev · 2019-06-08T20:41:48Z

src/encoding.rs

            // inserted into it the resulting vec are valid UTF-8. We check
            // explicitly in order to ensure this is safe.
-            super::bytes::merge(wire_type, value.as_mut_vec(), buf)?;
+
+            super::bytes::merge(wire_type, value.as_bytes_mut(), buf)?;
            str::from_utf8(value.as_bytes())
                .map_err(|_| DecodeError::new("invalid string value: data is not UTF-8 encoded"))?;


This looks wrong: value is left with broken UTF-8. The previous code was wrong too, FWIW.

String::as_mut_vec should be banned and replaced with a more principled scope guard API that would sanitize the buffer on unwrap. I've been too lazy to write an RFC on this.

@mzabaluev that's a great catch. Any interest in sending a PR? I imagine the fix will need to be something along the lines of swapping the string field with an empty string (to retain any allocated capacity), converting to a Vec<u8>, copy the bytes in, then converting back to a String and storing it?

IIUC the only way this could have been caught without a specific unit test is if we ran the fuzzer under MIRI, but I'm not sure if that's technically possible. In any case, it should be straightforward to check with a unit test.

It'd be worthwhile to audit for any other uses of as_mut_vec() as well, you are definitely right that that's a footgun of an API.

#194 has a fix for the master branch. That's the only use of String::as_mut_vec in the repository.

Just to make sure I understand the problem here - the issue is if there is an error in merge and we implicitly return via the ?. Do we assume that value is valid in the error case? I suppose the user is able to do whatever they want to so we shouldn't give them invalid data.

@nrc It is UB to expose invalid UTF-8 data inside a String. Both the error case, and any panic in the Buf implementation, may result in a malformed string being accessed.

mzabaluev · 2019-06-08T21:52:59Z

src/encoding.rs

-        value: &mut Vec<u8>,
-        buf: &mut B,
-    ) -> Result<(), DecodeError>
+    pub fn merge<B>(wire_type: WireType, value: &mut Bytes, buf: &mut B) -> Result<(), DecodeError>


In the current shape, this just replaces Vec::extend_from_slice with Bytes::extend_from_slice, so it does not really achieve the zero-copy optimization requested for by #31 on the decoding side. It's hard to see how this code alone could, though, since the source buffer is behind the Buf abstraction. How about changing the merge method to extract data from Bytes, so that the value representations backed by Bytes could split from the buffer?

Right, this is why I've been holding off on adding support for Bytes fields; as I mentioned in #31 it's not possible to get the ref counting zero copy support without specialization. Switching Bytes for Buf in the Message API is not something I want to do.

To be clear, I don't have any issue landing a PR to add support for Bytes fields, and if it's technically possible I'm also in favor of supporting zero-copy decoding through a nightly feature flag + specialization. The only thing I'm not willing to do at this point is change the Message API to enable zero copy.

nrc · 2019-06-09T22:17:55Z

Re. Performance, in a bit surprised that this resulted in improvements. My understanding is that this PR isn’t using ref count sharing to clone into the fields, right? If not, what explains the perf difference?

The performance improvement is from the improved allocation patterns - 'small' Bytes are allocated inline rather than on the heap and this avoids a lot of small allocations, since that is slow with some allocators, this causes a big improvement. On platforms with better allocators (or using a custom allocator) the difference is far less significant.

nrc · 2019-06-09T22:27:13Z

I’d prefer if this were made configurable as an option on Prost-build in the same way that hashmap/btree map is. That way the default can stay as the std types, and the bytes based types can be used if/when perf issues arise.

I agree re. Not maintaining our own string wrapper over bytes, it seems like well want to use whatever shakes out as the community choice.

To clarify, how would you like to proceed? Use Bytes for Vec<u8> but do nothing with Strings for now, and with the Bytes stuff behind a Cargo feature?

danburkert · 2019-06-10T16:35:30Z

To clarify, how would you like to proceed? Use Bytes for Vec but do nothing with Strings for now, and with the Bytes stuff behind a Cargo feature?

Good question, there are a couple of concerns here so I'll try and break them out.

Whether to do only bytes in this PR or also support string as well: I don't have a strong preference, I'd defer to you as the implementer to do what's more straightforward.
What will the default generated types be: I have a strong preference that the default for a protobuf string translates to a Rust String (the status quo). I don't have a strong preference for bytes, but w/o a compelling reason otherwise I'd err on the side of consistency, so it would default to Vec<u8>.
how users will opt-in to changing the generated type: A Cargo feature isn't a good fit for the opt-in configuration here, due to the viral nature of cargo features. Instead, I think it'd be better to follow the lead of prost-build::Config::btree_map and make it configurable at a finer grained level. This will make the feature harder to implement -- it's going to require effectively adding new 'primitive' field types to prost-derive and prost::encoding -- but I feel it's the 'right' choice WRT configurability and avoiding surprising cargo feature side effects.

danburkert · 2019-06-10T16:37:09Z

The performance improvement is from the improved allocation patterns - 'small' Bytes are allocated inline rather than on the heap and this avoids a lot of small allocations, since that is slow with some allocators, this causes a big improvement. On platforms with better allocators (or using a custom allocator) the difference is far less significant.

I see, thanks. That makes sense, and is a compelling reason to support this feature even independent of zero-copy decoding.

nrc · 2019-06-11T03:33:13Z

Good question, there are a couple of concerns here so I'll try and break them out.

Cool, thanks for the pointers! I'll work on bringing this PR inline with that. I would like to get #186 and #187 landed before I work more on this (those two are blocking us using Prost, whereas this one is not).

mzabaluev · 2019-06-11T04:26:46Z

Would it be hard or unnecessarily complicated to make the choice of representation types for bytes and string fields configurable in prost-build? What would be necessary to implement for the repr types, besides Default and the traits for encoding and decoding?

mzabaluev · 2019-07-13T09:27:13Z

I have published strchunk to crates.io now. StrChunk can be used for string fields instead of a wrapper type specific to this crate.

quininer · 2019-11-25T11:49:05Z

The latest bytes have deleted SBO, Is there still performance improvement?

mzabaluev · 2019-11-25T11:59:18Z

The latest bytes have deleted SBO, Is there still performance improvement?

The SBO was a dubious optimization to begin with, as it caused additional branching and it was not clear at all that much usage would fit within the inlined buffers even inside a single application.
The zero-copy advantage of Bytes persists, though.

nrc · 2019-11-27T03:46:41Z

We'd need to benchmark to know. SBO made a huge difference with the default allocator, but not too much difference when using jemalloc. Would be interesting to investigate.

rolftimmermans · 2020-05-11T13:25:27Z

I am interested in using Bytes instead of Vec<u8> for fields of type bytes. We are dealing with messages where most of the payload is a medium-large blob. It would be great to be able to decode the message without copying the blobs. I'd be happy with any sort of opt-in mechanism (either per field, or as a Cargo feature).

Is this PR still on the radar?

danburkert · 2020-05-11T16:42:31Z

I think the state of things is still the same as what I summarized in https://github.com/danburkert/prost/pull/190/files#r291820981. I'm open to a PR which adds Bytes field generation as an option for protobuf string/bytes fields, but it needs to be opt-in. There's already a pattern for this kind of opt-in codegen change in the repo with HashMap and BTreeMap.

As far as zero-copy deserialization, I think it's still going to require changes to either the bytes crate (I think an additional API such as fn Buf::take_bytes(usize) -> Bytes), or specialization. Landing support for Bytes fields will be a good first step, then we can prioritize zero copy from there.

cbeck88 · 2020-05-12T03:44:57Z

@danburkert what would you think about using a trait-based approach instead of what is proposed in this PR? Like, if it supports From<&[u8]> and AsRef<&[u8]>, (or possibly, TryFrom instead of From?) then it can be used as bytes in prost-derive? Prost itself likely doesn't use any vec-specific stuff right? or it could abstract what it does use as a trait?

Or maybe, what I'm saying is orthogonal to this -- I'm describing a feature of prost-derive and they are describing a feature of prost-build

danburkert · 2020-05-12T04:06:02Z

@garbageslam not sure I fully understand, but that sounds orthogonal. prost definitely does 'know' about vec; the derived code calls into vec specific serialization/deserialization routines defined in src/encoding.rs.

cbeck88 · 2020-05-12T05:05:24Z

I see, I forgot this, never mind. thanks

rolftimmermans · 2020-05-26T09:18:22Z

I think the state of things is still the same as what I summarized in https://github.com/danburkert/prost/pull/190/files#r291820981. I'm open to a PR which adds Bytes field generation as an option for protobuf string/bytes fields, but it needs to be opt-in. There's already a pattern for this kind of opt-in codegen change in the repo with HashMap and BTreeMap.

As far as zero-copy deserialization, I think it's still going to require changes to either the bytes crate (I think an additional API such as fn Buf::take_bytes(usize) -> Bytes), or specialization. Landing support for Bytes fields will be a good first step, then we can prioritize zero copy from there.

Thanks for your suggestions. I implemented this in #337.

danburkert · 2020-11-15T19:57:22Z

Thanks for your suggestions. I implemented this in #337.

Closing accordingly!

nrc added 3 commits May 30, 2019 15:14

Use Bytes for the Rust type for protobuf bytes

0dccea3

Signed-off-by: Nick Cameron <nrc@ncameron.org>

Use a wrapper type instead of String

6d5da10

Signed-off-by: Nick Cameron <nrc@ncameron.org>

Fix tests

a428895

Signed-off-by: Nick Cameron <nrc@ncameron.org>

danburkert mentioned this pull request Jun 7, 2019

Avoid cloning for parameter in RPC call. tower-rs/tower-grpc#184

Open

mzabaluev reviewed Jun 8, 2019

View reviewed changes

mzabaluev mentioned this pull request Jun 9, 2019

Unsound use of String::as_mut_vec #193

Closed

Merge branch 'master' into bytes

99d50e0

mzabaluev mentioned this pull request Jun 10, 2019

Fix UTF-8 unsoundness in string::merge #194

Merged

mzabaluev mentioned this pull request Jun 23, 2019

RFC: Zero Cost Abstraction Reform tokio-rs/bytes#268

Closed

nrc mentioned this pull request Sep 30, 2019

Improve Prost performance tikv/tikv#5571

Open

danburkert closed this Nov 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with using `bytes::Bytes` to back bytes and string fields #190

Experiment with using `bytes::Bytes` to back bytes and string fields #190

nrc commented May 30, 2019

mzabaluev commented Jun 7, 2019

danburkert commented Jun 7, 2019 •

edited

danburkert commented Jun 7, 2019

mzabaluev Jun 8, 2019

danburkert Jun 8, 2019

danburkert Jun 8, 2019

mzabaluev Jun 9, 2019

nrc Jun 9, 2019

mzabaluev Jun 10, 2019

mzabaluev Jun 8, 2019

danburkert Jun 8, 2019

danburkert Jun 8, 2019

nrc commented Jun 9, 2019

nrc commented Jun 9, 2019

danburkert commented Jun 10, 2019

danburkert commented Jun 10, 2019

nrc commented Jun 11, 2019

mzabaluev commented Jun 11, 2019

mzabaluev commented Jul 13, 2019

quininer commented Nov 25, 2019

mzabaluev commented Nov 25, 2019

nrc commented Nov 27, 2019

rolftimmermans commented May 11, 2020

danburkert commented May 11, 2020

cbeck88 commented May 12, 2020 •

edited

danburkert commented May 12, 2020

cbeck88 commented May 12, 2020

rolftimmermans commented May 26, 2020

danburkert commented Nov 15, 2020

Experiment with using bytes::Bytes to back bytes and string fields #190

Experiment with using bytes::Bytes to back bytes and string fields #190

Conversation

nrc commented May 30, 2019

mzabaluev commented Jun 7, 2019

danburkert commented Jun 7, 2019 • edited

danburkert commented Jun 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nrc commented Jun 9, 2019

nrc commented Jun 9, 2019

danburkert commented Jun 10, 2019

danburkert commented Jun 10, 2019

nrc commented Jun 11, 2019

mzabaluev commented Jun 11, 2019

mzabaluev commented Jul 13, 2019

quininer commented Nov 25, 2019

mzabaluev commented Nov 25, 2019

nrc commented Nov 27, 2019

rolftimmermans commented May 11, 2020

danburkert commented May 11, 2020

cbeck88 commented May 12, 2020 • edited

danburkert commented May 12, 2020

cbeck88 commented May 12, 2020

rolftimmermans commented May 26, 2020

danburkert commented Nov 15, 2020

Experiment with using `bytes::Bytes` to back bytes and string fields #190

Experiment with using `bytes::Bytes` to back bytes and string fields #190

danburkert commented Jun 7, 2019 •

edited

cbeck88 commented May 12, 2020 •

edited