Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lucacasonato
Copy link
Contributor

@lucacasonato lucacasonato commented Apr 12, 2022

Previously #828 added support for deserializing lone leading and
trailing surrogates into WTF-8 encoded bytes when deserializing a string
as bytes. This commit extends this to cover the case of a leading
surrogate followed by code units that are not trailing surrogates. This
allows for deserialization of "\ud83c\ud83c" (two leading surrogates),
or "\ud83c\u0061" (a leading surrogate followed by "a").

The docs also now make it clear that we are serializing the invalid code
points as WTF-8. This reference to WTF-8 signals to the user that they
can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.

Follow up to #830 (review).

Comment on lines +886 to 888
// TODO: the error message is wrong, this is a lone
// _trailing_ surrogate
error(read, ErrorCode::LoneLeadingSurrogateInHexEscape)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding another error code? Would doing so even be semver compatible? (unrelated to this PR)

}

#[test]
fn test_byte_buf_de_surrogate_pair() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no test for parsing valid surrogate pairs into byte bufs that I could find, so I added one.

Previously serde-rs#828 added support for deserializing lone leading and
trailing surrogates into WTF-8 encoded bytes when deserializing a string
as bytes. This commit extends this to cover the case of a leading
surrogate followed by code units that are not trailing surrogates. This
allows for deserialization of "\ud83c\ud83c" (two leading surrogates),
or  "\ud83c\u0061" (a leading surrogate followed by "a").

The docs also now make it clear that we are serializing the invalid code
points as WTF-8. This reference to WTF-8 signals to the user that they
can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.
@lucacasonato
Copy link
Contributor Author

lucacasonato commented May 18, 2022

@dtolnay Have you had a chance to look into this? It'd be great to get your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant