Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserializing lone surrogate to ByteBuf fails when it's nested in an enum #1089

Open
helixbass opened this issue Dec 10, 2023 · 1 comment
Open

Comments

@helixbass
Copy link

Hi, from what I can tell this may be a bug here (vs me misunderstanding)

I'm trying to deserialize JSON that contains strings with unpaired "lone" surrogates. I saw eg #828 and so tried making the "target" field type a serde_bytes::ByteBuf but am still seeing it fail with an "unexpected end of hex escape" deserialization error

I made a test repo at https://github.com/helixbass/test-serde-bytes-unpaired-surrogate-deserialization that seems to demonstrate that deserializing lone surrogates into a "top-level" ByteBuf field or nested inside an outer struct is working as expected but that for some reason (I don't understand the "inner machinery" of serde/serde_json so don't really have a guess as to why this is) when it's nested inside eg a tagged or untagged enum type it is failing

@helixbass
Copy link
Author

Sniffing at eg the output of cargo expand, it looks like the basic reason this is happening is because the derived Deserialize implementations for untagged/internally-tagged enums use .deserialize_any() (+ Content?) which has to sort of "blindly" decide what to do with a JSON string that it encounters, and so understandably defaults to trying to deserialize it as a "string" (vs as "bytes")

So I'm assuming this should not be considered a "bug" of any kind but am curious if there are known strategies that would enable using serde/serde_json to deserialize eg internally-tagged enums with fields that may contain unpaired surrogates?

The two general ideas that I can picture are:

  1. Have like an alternate version of serde_json whose .deserialize_any() defaults to treating JSON strings as bytes, not as strings
  2. If you know that your "tag" field should come first in the JSON (which I think is a safe assumption/invariant in my use case) then it seems like you could defer deserializing the rest of the JSON object keys/values until after you've deserialized/"recognized" the tag field (at which point you could then avoid using .deserialize_any() because you'd know which enum variant you were deserializing, similar to how I'm assuming that "externally tagged" enums apparently avoid this issue)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant