New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deserialize invalid UTF-8 into byte bufs as WTF-8 #877
base: master
Are you sure you want to change the base?
Conversation
// TODO: the error message is wrong, this is a lone | ||
// _trailing_ surrogate | ||
error(read, ErrorCode::LoneLeadingSurrogateInHexEscape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding another error code? Would doing so even be semver compatible? (unrelated to this PR)
} | ||
|
||
#[test] | ||
fn test_byte_buf_de_surrogate_pair() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no test for parsing valid surrogate pairs into byte bufs that I could find, so I added one.
c3bfa51
to
fbd1d68
Compare
Previously serde-rs#828 added support for deserializing lone leading and trailing surrogates into WTF-8 encoded bytes when deserializing a string as bytes. This commit extends this to cover the case of a leading surrogate followed by code units that are not trailing surrogates. This allows for deserialization of "\ud83c\ud83c" (two leading surrogates), or "\ud83c\u0061" (a leading surrogate followed by "a"). The docs also now make it clear that we are serializing the invalid code points as WTF-8. This reference to WTF-8 signals to the user that they can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.
fbd1d68
to
f50e296
Compare
@dtolnay Have you had a chance to look into this? It'd be great to get your review. |
Previously #828 added support for deserializing lone leading and
trailing surrogates into WTF-8 encoded bytes when deserializing a string
as bytes. This commit extends this to cover the case of a leading
surrogate followed by code units that are not trailing surrogates. This
allows for deserialization of "\ud83c\ud83c" (two leading surrogates),
or "\ud83c\u0061" (a leading surrogate followed by "a").
The docs also now make it clear that we are serializing the invalid code
points as WTF-8. This reference to WTF-8 signals to the user that they
can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.
Follow up to #830 (review).