Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink the serde [u8; 16] binary representation #557

Closed
KodrAus opened this issue Nov 2, 2021 · 26 comments · Fixed by #564
Closed

Rethink the serde [u8; 16] binary representation #557

KodrAus opened this issue Nov 2, 2021 · 26 comments · Fixed by #564

Comments

@KodrAus
Copy link
Member

KodrAus commented Nov 2, 2021

Blocking #553
Following #329

The currently proposed stable uuid API changes the serde implementation from &[u8] to [u8; 16]. This is a breaking change to existing serialized data.

It's been pointed out on Reddit that [u8; 16] isn't really a universal improvement over &[u8] for binary formats, because some can treat byte slices specially.

Before stabilizing, we should try come to some consensus on this one way or another. If we can't, then I think we should fall back to the status-quo of &[u8]. So I think the options are:

  • &[u8]: the current approach. Can be optimized by some binary formats compared to tuples/sequences. Produces a redundant length field. May be possible to do bytemucking to re-interpret serialized bits directly as a UUID.
  • [u8; 16]: what's on main. Is better for some binary formats that don't have an overhead per field on sequences, but is less optimal for some binary formats than &[u8]. Will cause breakage for existing data.
  • u128: what I'm currently proposing. Can be formatted optimally by any binary format. Isn't necessarily guaranteed to be supported by all serializers/deserializers. The encoded UUID won't look like a UUID. Will cause breakage for existing data.
  • Some kind of Serializer::serialize_byte_array<const N: usize>(bytes: &[u8; N]) in serde that forwards to &[u8]?
  • Anything else?

cc @smarnach @Marwes @mitsuhiko (as having originally provided input on the serde format)

Also just to re-iterate, this only affects non-human-readable formats like bincode. Other formats like serde-json are still going to use hyphenated strings by default.

@KodrAus
Copy link
Member Author

KodrAus commented Nov 2, 2021

From @mitsuhiko #553 (comment)

I think the serde change is not positive. It's untypical, generally serde hasn't been used like this and it's going to break a lot of formats. I'm also not convinced that u128 is a better solution here. Does the length information cause a problem in practice that it's worth considering going out of the way to do something untypical here?

@KodrAus
Copy link
Member Author

KodrAus commented Nov 2, 2021

From @Icarium-Lifestealer on Reddit:

I don't understand the serde change. Which type of the serde data model does a uuid map to now, if not byte array [u8]? A tuple with 16 members? (that sounds very inefficient for formats where fields have overhead)

@KodrAus KodrAus changed the title Rethink the [u8; 16] binary representation Rethink the serde [u8; 16] binary representation Nov 3, 2021
@rakshith-ravi
Copy link

Just a random idea, but how about a feature flag for the different formats. For example, a serde-simple format can use the SimpleFormat in serde, and accordingly we can have different feature flags for different formats, with the default ones being the existing implementations, to avoid breaking existing crates. Additionally, people can enable their own feature flags for whichever format they specifically want to use, in case they want something different

@kinggoesgaming
Copy link
Member

Just a random idea, but how about a feature flag for the different formats. For example, a serde-simple format can use the SimpleFormat in serde, and accordingly we can have different feature flags for different formats, with the default ones being the existing implementations, to avoid breaking existing crates. Additionally, people can enable their own feature flags for whichever format they specifically want to use, in case they want something different

The problem with this is conflicting feature flags downstream, when different libraries decide to use one or other format by default.

@KodrAus
Copy link
Member Author

KodrAus commented Nov 3, 2021

I think the serde change is not positive. It's untypical, generally serde hasn't been used like this and it's going to break a lot of formats.

Hmm, I wouldn't consider serializing as [u8; 16] as particularly untypical. If we'd just dropped a #[derive(Serialize)] on Uuid that's exactly what we'd get.

If we are going to change the format at all, I think u128 is probably a better bet than [u8; 16]. It gives any serializer a chance to do the most optimal thing with a 128bit value, rather than &[u8] where the length is redundant or [u8; 16] where you can't distinguish from an array of bytes or an array of something else. Not necessarily knowing those 128bits encoded in the format are actually a UUID doesn't seem like a major drawback. If you do want to do zero-copy deserialization then there are other tools besides serde (zerocopy and bytemuck) that facilitate that.

@rakshith-ravi
Copy link

rakshith-ravi commented Nov 3, 2021

The problem with this is conflicting feature flags downstream, when different libraries decide to use one or other format by default.

So if someone wanted to use a specific format for their UUIDs in their application, how would they go about doing it without manually implementing Serialize or Deserialize?

@KodrAus
Copy link
Member Author

KodrAus commented Nov 3, 2021

@rakshith-ravi They'd have to manually implement Serialize or Deserialize to tweak the way specific fields are serialized. There are attributes you can use with #[derive(Serialize, Deserialize)] to change things without needing to manually implement the traits, but Cargo features aren't really a tool for altering the public contract of a library in any way that isn't purely additive (and the serialization format is a part of the public contract).

@Marwes
Copy link
Contributor

Marwes commented Nov 3, 2021

From @Icarium-Lifestealer on Reddit:

I don't understand the serde change. Which type of the serde data model does a uuid map to now, if not byte array [u8]? A tuple with 16 members? (that sounds very inefficient for formats where fields have overhead)

We could equally well state this problem as an issue with the format, rather than with serde/uuid. I suppose the tuple serializing does not give the information upfront that the fields all have the same time, so perhaps this indicates that serde should have a way to serialize/deserialize fixed length arrays (currently you get either fixed size in tuples for fixed element type in sequences, but not both).

@KodrAus
Copy link
Member Author

KodrAus commented Nov 3, 2021

perhaps this indicates that serde should have a way to serialize/deserialize fixed length arrays

If you peel the crossed out sticker off that fourth point I was wondering the same thing 🙂 Some kind of serialize_array<T: Serialize, const N: usize>(self, arr: &[T; N]) that forwards to serialize_seq (or possible to serialize_tuple for compatibility) would be a nice fit. It could be something to propose to serde.

@KodrAus
Copy link
Member Author

KodrAus commented Nov 3, 2021

Unless there’s some clear desire to figure out [T; N] support in serde in the immediate future though I don’t think we should let that block us.

@mitsuhiko
Copy link

I think the problem is not so much that serde does not have array serialization but that serde does not have a good strategy for bytes. In most formats there is a significant difference between bytes and arrays or vectors of other types. That's why serde_bytes is needed for instance to invoke the bytes interface. There is however no equivalent to "byte array of size N" or "byte tuple".

It's true that serde generates out tuple serialization code for any fixed size array and I think this is absolutely valid for when you're working with non u8 types. For u8 however this has been shown to be problematic.

@mitsuhiko
Copy link

Unless there’s some clear desire to figure out [T; N] support in serde in the immediate future though I don’t think we should let that block us.

There is no way to support this in serde because of an unfortunate decision to have different trait bounds for [T; 0]. So we're going to be stuck with this for a long time.

@Marwes
Copy link
Contributor

Marwes commented Nov 3, 2021

Serde wouldn't need to explicitly support [T; N] though, a method like fn serialize_fixed_array(&mut self, array: &[T]) would work just as well.

@sfackler
Copy link
Contributor

sfackler commented Nov 3, 2021

IMO going through serialize_bytes explicitly for non-human-readable serializers would be a better approach than going through arrays. It both more closely matches the behavior of serializers in other languages (e.g. Jackson), and should be more efficient since the serializer gets the whole buffer up-front rather than being fed each byte in sequence.

@smarnach
Copy link

smarnach commented Nov 3, 2021

I still think the most natural approach is to serialize a Uuid as a [u8; 16], because it is a [u8;16].

[…] should be more efficient since the serializer gets the whole buffer up-front rather than being fed each byte in sequence.

At least for the bincode serde format, it's less efficient to serialize it as a byte array, since that requires a redundant length tag, so it end up serializing to 24 bytes rather than 16. This was the motivation for changing the serialization format.

What binary serde formats would be adversely affected by serlializing as a [u8; 16]?

@sfackler
Copy link
Contributor

sfackler commented Nov 3, 2021

What binary serde formats would be adversely affected by serlializing as a [u8; 16]?

Any binary JSON-style format: serde-cbor, serde-smile, serde-bson, etc.

@KodrAus
Copy link
Member Author

KodrAus commented Nov 3, 2021

It’s sounding like [u8; 16] is probably not the way forward then for various reasons. So our remaining options are:

  • Go back to serialize_bytes and continue to wear the redundant length.
  • See if there’s any interest in a serialize_array or even more scoped serialize_byte_array method in serde that would let us tell a format the length of the byte array is fixed. I could imagine this would need some coordinated support in various formats to be usable.
  • Use u128.

Are there any thoughts on u128?

@sfackler
Copy link
Contributor

sfackler commented Nov 3, 2021

I'd be worried about the portability of u128 across formats - for example, serde-cbor can't encode them.

@KodrAus
Copy link
Member Author

KodrAus commented Nov 3, 2021

@sfackler Ah ok, is it that it simply can’t encode them? Or that it needs to use a fallback? I’ve seen some formats route 128bit numbers through serialize_bytes.

I’d kind of expected 128bit number support to be pretty good by now (at least not hotting the error path in the majority of formats), but maybe it’s just not yet.

@KodrAus
Copy link
Member Author

KodrAus commented Nov 3, 2021

Ok, I've had a look at a couple binary formats to see how size compares between &[u8] and [u8; 16]. These results are for a struct containing a single field, where that field is one of the proposed representations:

bincode (bytes): Ok(24)
bincode (tuple): Ok(16)
bincode (u128) : Ok(16)

msgpack (bytes): Ok(19)
msgpack (tuple): Ok(20)
msgpack (u128) : Ok(19)

flexbuf (bytes): Ok(33)
flexbuf (tuple): Ok(33)
flexbuf (u128) : Err(Serde("u128 is not supported"))

postcrd (bytes): Ok(17)
postcrd (tuple): Ok(16)
postcrd (u128) : Ok(16)

pickle  (bytes): Ok(34)
pickle  (tuple): Ok(50)
pickle  (u128) : Err(Syntax(Structure("u128 is not supported")))

cbor    (bytes): Ok(24)
cbor    (tuple): Ok(24)
cbor    (u128) : Ok(25)

It looks like 128bit numbers actually are rather spottily supported indeed, so I think that rules u128 out as an option.

But [u8; 16] doesn't actually look too bad on this sample:

  • pickle has a very bad time with it. It becomes ~1.5x larger. This is equivalent to the decrease in size in bincode between tuples and slices.
  • msgpack requires an extra byte.

I'd be keen to add other non-human-readable formats to this list and evaluate them, but on balance [u8; 16] actually looks ok for UUIDs.

@KodrAus
Copy link
Member Author

KodrAus commented Nov 4, 2021

I think @sfackler's original concern about the lazily written tuple performance is also justified though.

Writing a struct with a single field containing the various representations on my M1 MacBook with maximum optimizations results in something like:

running 18 tests
test bincode_bytes     ... bench:          25 ns/iter (+/- 0)
test bincode_tuple     ... bench:          37 ns/iter (+/- 1)
test bincode_u128      ... bench:          25 ns/iter (+/- 0)

test cbor_bytes        ... bench:          89 ns/iter (+/- 0)
test cbor_tuple        ... bench:         161 ns/iter (+/- 23)
test cbor_u128         ... bench:         122 ns/iter (+/- 1)

test flexbuffers_bytes ... bench:         376 ns/iter (+/- 18)
test flexbuffers_tuple ... bench:         747 ns/iter (+/- 24)
test flexbuffers_u128  ... bench:         283 ns/iter (+/- 1)

test msgpack_bytes     ... bench:          31 ns/iter (+/- 0)
test msgpack_tuple     ... bench:          84 ns/iter (+/- 0)
test msgpack_u128      ... bench:          31 ns/iter (+/- 0)

test pickle_bytes      ... bench:          43 ns/iter (+/- 0)
test pickle_tuple      ... bench:         112 ns/iter (+/- 10)
test pickle_u128       ... bench:         104 ns/iter (+/- 0)

test postcard_bytes    ... bench:          97 ns/iter (+/- 12)
test postcard_tuple    ... bench:          63 ns/iter (+/- 0)
test postcard_u128     ... bench:          29 ns/iter (+/- 0)

It's more expensive in all cases except for postcard.

This still seems to be the case even when we amortize the cost of re-allocating the Vec multiple times:

test bincode_bytes      ... bench:          25 ns/iter (+/- 0)
test bincode_bytes_into ... bench:           4 ns/iter (+/- 0)
test bincode_tuple      ... bench:          38 ns/iter (+/- 0)
test bincode_tuple_into ... bench:          33 ns/iter (+/- 0)
test bincode_u128       ... bench:          25 ns/iter (+/- 0)
test bincode_u128_into  ... bench:           0 ns/iter (+/- 0)

@KodrAus
Copy link
Member Author

KodrAus commented Nov 4, 2021

I started writing up a proposal for supporting fixed-size byte arrays in serde and hit a roadblock where I don't think it's possible to avoid the trade-offs of &[u8] vs [u8; N] without explicit Deserializer support. Edit: When it's all mapped out it looks a bit uncompelling, but I've opened serde-rs/serde#2120

Since serde has made the decision that [T; N] is the same as (T, ..N) I don't think any proposal for general-purpose array support beyond tuples is likely to get any traction.

I think that leaves us picking between the trade-offs of &[u8] and [u8; 16]. So now it's a question of:

  • Potential space saving for some formats and overheads for others with &[u8].
  • Potential space saving for some formats and overheads for others with [u8; 16]. Potential performance overheads with [u8; 16]. Need for ecosystem migration with [u8; 16].

@KodrAus
Copy link
Member Author

KodrAus commented Nov 4, 2021

Maybe we can do a bit of a reaction poll to gauge the current temperature. If it's not possible to have everything through explicit fixed-size-byte-array support should we:

  • 🚀 to revert back to &[u8]
  • 🎉 to stick to [u8; 16]

(I couldn't think of which option should get the 👍 vs the 👎 so we get rockets and party poppers)

@sfackler
Copy link
Contributor

sfackler commented Nov 4, 2021

If we want to hit the serialize_bytes codepath, self.as_bytes().serialize(ser) won't work - all slices use serialize_seq instead afaik.

@KodrAus
Copy link
Member Author

KodrAus commented Nov 4, 2021

If we want to hit the serialize_bytes codepath, self.as_bytes().serialize(ser) won't work

Yeh, the old implementation would call serializer.serialize_bytes itself.

@KodrAus
Copy link
Member Author

KodrAus commented Nov 8, 2021

So, based on everything we've learned about the trade-offs of &[u8] vs [u8; 16] vs u128 I think we're stuck without a single ideal approach that's going to suit everybody's needs. Uuid needs to pick one of them though. It seems that when we weigh everything up with each option, the potential space savings, performance implications, compatibility, and potential for churn and breakage, the most appropriate choice is to revert back to serialize_bytes. I'm sorry to the folks who wanted [u8; 16]. I can appreciate that for your needs it's the best choice, and the 8 redundant bytes in bincode is definitely unfortunate, but we don't have the luxury of an option that appeals to even a majority of users, so in this case I think the status-quo wins.

It's been great to really try dig into this, I appreciate everyone's involvement and I'll still give serde-rs/serde#2120 some time to play out, but realistically I expect we'll be back at serialize_bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants