Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

lucacasonato · 2022-04-12T00:10:49Z

Previously #828 added support for deserializing lone leading and
trailing surrogates into WTF-8 encoded bytes when deserializing a string
as bytes. This commit extends this to cover the case of a leading
surrogate followed by code units that are not trailing surrogates. This
allows for deserialization of "\ud83c\ud83c" (two leading surrogates),
or "\ud83c\u0061" (a leading surrogate followed by "a").

The docs also now make it clear that we are serializing the invalid code
points as WTF-8. This reference to WTF-8 signals to the user that they
can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.

Follow up to #830 (review).

lucacasonato · 2022-04-12T00:12:20Z

src/read.rs

+                        // TODO: the error message is wrong, this is a lone
+                        // _trailing_ surrogate
                        error(read, ErrorCode::LoneLeadingSurrogateInHexEscape)


Worth adding another error code? Would doing so even be semver compatible? (unrelated to this PR)

lucacasonato · 2022-04-12T00:13:53Z

tests/test.rs

+}
+
+#[test]
+fn test_byte_buf_de_surrogate_pair() {


There was no test for parsing valid surrogate pairs into byte bufs that I could find, so I added one.

Previously serde-rs#828 added support for deserializing lone leading and trailing surrogates into WTF-8 encoded bytes when deserializing a string as bytes. This commit extends this to cover the case of a leading surrogate followed by code units that are not trailing surrogates. This allows for deserialization of "\ud83c\ud83c" (two leading surrogates), or "\ud83c\u0061" (a leading surrogate followed by "a"). The docs also now make it clear that we are serializing the invalid code points as WTF-8. This reference to WTF-8 signals to the user that they can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.

lucacasonato · 2022-05-18T19:28:28Z

@dtolnay Have you had a chance to look into this? It'd be great to get your review.

lucacasonato commented Apr 12, 2022

View reviewed changes

lucacasonato force-pushed the wtf8_encoding branch from c3bfa51 to fbd1d68 Compare April 12, 2022 00:14

lucacasonato force-pushed the wtf8_encoding branch from fbd1d68 to f50e296 Compare May 18, 2022 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

lucacasonato commented Apr 12, 2022 •

edited

lucacasonato Apr 12, 2022 •

edited

lucacasonato Apr 12, 2022

lucacasonato commented May 18, 2022 •

edited

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

Are you sure you want to change the base?

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

Conversation

lucacasonato commented Apr 12, 2022 • edited

lucacasonato Apr 12, 2022 • edited

Choose a reason for hiding this comment

lucacasonato Apr 12, 2022

Choose a reason for hiding this comment

lucacasonato commented May 18, 2022 • edited

lucacasonato commented Apr 12, 2022 •

edited

lucacasonato Apr 12, 2022 •

edited

lucacasonato commented May 18, 2022 •

edited