Allow lone surrogates in raw values #830

lucacasonato · 2021-11-25T13:41:29Z

See https://github.com/serde-rs/json/issues/827#issuecomment-979208987

lucacasonato · 2021-11-25T16:41:18Z

This allows for a proper LossyString type for serde_json: https://gist.github.com/lucacasonato/772e3bbed29db7835d45ea1aa1a6f1c7.

dtolnay

I don't follow why #828 is not suitable for your use case. How come all the string processing logic has to be duplicated instead of deserializing to bytebuf and patching out all the leftover surrogates with fffd?

lucacasonato · 2021-11-25T20:02:18Z

fn main() {
    println!("{}", String::from_utf8_lossy(&[b'a', 237, 160, 188, b'b']));
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=9e9722459d4d8eaa4989ca88e85deef3

Because UTF-8 decoders will treat this sequence that is meant to encode the lone surrogate as a sequence of three invalid codepoints, thus emitting 3 U+FFFD chars instead of just one. This is the case for both the UTF8 parser in Rust's stdlib, and the one in encoding_rs, and the one specified by https://encoding.spec.whatwg.org (browser ones). As such, I don't think this is a valid way of representing lone surrogates.

The PR is useful even if no lone surrogates are used, because right now RawValue can not represent strings with lone surrogates, even though ByteBuf can decode them.

Ideally LossyString would be a serde_json native magic type, like RawValue, that will handle the replacement chars internally. I was not sure how in favor you'd be of those, which is why I prototyped it out of tree.

dtolnay · 2021-11-25T20:26:23Z

You're fully in control of the interpretation of bytes to utf-8 though. It doesn't really seem relevant how other logic that interprets bytes to utf-8 does it if those aren't the logic you need.

My expectation would be something like:

let mut bytes: ByteBuf = /* ... */;
for i in 2..bytes.len() {
    // surrogates are codepoints 1101_1xxxxx_xxxxxx
    if bytes[i - 2] == 0b1110_1101 && bytes[i - 1] >= 0b1010_0000 {
        bytes[i - 2..=i].copy_from_slice("\u{fffd}".as_bytes());
    }
}

dtolnay · 2021-11-25T21:01:57Z

To elaborate a bit in response to:

I don't think this is a valid way of representing lone surrogates.

If "\ud83c" gets assigned some meaning by serde_json (as required by #827, since JSON.stringify can apparently emit such a string in JS), then there needs to be some way we pick to represent it. Whatever we do is a relatively arbitrary choice but the reason that the representation I suggested in #827 is a good one is it's lossless — so for example one could take split pieces of a string from JS, like JSON.stringify(s.substr(0, i)) + JSON.stringify(s.substr(i)), deserialize them in Rust and reconstruct the original unsplit string even when the split turned out to fall in the middle of a utf-16 surrogate pair. Among all the possibilities for a lossless representation I believe this is the most straightforward one, but I would be open to one that you find more straightforward.

lucacasonato · 2021-11-25T21:06:08Z

That would work, but it still seems pretty unfortunate that the ByteBuf contains a byte sequence that looks mostly like UTF-8 encoded text, but isn't really, and a UTF-8 decoder will choke up on it. I'd much rather prefer this:

A ByteBuf deserialized from a JSON string is always a byte sequence of a valid UTF-8 encoded string (error on lone surrogate / invalid surrogate pairs).
- Most people likely expect that they can put the result into either String::from_utf8 or String::from_utf8_lossy and get a reasonable result.
- This means reverting my previous patch.
If you want the behaviour where of lone surrogates and invalid surrogate pairs are replaced with U+FFFD, use a LossyString (replace on lone surrogate / invalid surrogate pairs).
- This could be built in to serde_json, or implemented elsewhere with copied escape parsing code
- This is likely what people want for many JSON messages coming from JavaScript, as that is what the built in UTF-8 encoder does - it never emits [237, 160, 188] (emitting U+FFFD instead)
- The current ByteBuf case only handles lone surrogates, but what about invalid surrogate pairs? I would want those to be replaced too, but serde_json just errors on them right now. The JSON string r#""\ud83c\uaaaa""# should emit a single U+FFFD, not two. Currently it always errors (both when decoding a ByteBuf and regular String). If ByteBuf would emit this invalid surrogate pair, how would one replace the pair with a single U+FFFD (not two!)
If you want full control over the string itself, you can parse it yourself using RawValue.
- RawValue should also be used if you are partially deserializing a message that could contain lone surrogates or invalid surrogate pairs. Value can not encode lone surrogates, or invalid surrogate pairs.
- RawValue is also useful if you are forwarding some valid JSON (currently not possible if that JSON contains lone surrogates or invalid surrogate pairs in \u escape sequences.

EDIT: looks like we were both typing at once :-) - this comment does not incorporate a response to your previous comment yet.

lucacasonato · 2021-11-25T21:28:11Z

so for example one could take split pieces of a string from JS, like JSON.stringify(s.substr(0, i)) + JSON.stringify(s.substr(i)), deserialize them in Rust and reconstruct the original unsplit string even when the split turned out to fall in the middle of a utf-16 surrogate pair

I don't really see how this is feasible to do with the current implementation. Let's say you have the string 🌎 and you split this at the first codepoint: JSON.stringify(["🌎".substring(0, 1), "🌎".substring(1, 2)]);. You will get two JSON strings: ["\\ud83c","\\udf0e"]. If I were to parse both of these strings using byte buffers, I'd get two byte buffers: 0xED 0xA0 0xBC and 0xED 0xBC 0x8E. Concatenated to a single byte buffer this would be 0xED 0xA0 0xBC 0xED 0xBC 0x8E. I could not send this through a UTF8 decoder directly. It would either error, or give me back a string with 6 replacement chars. I could go in and check if the last byte is a UTF-8 surrogate, and if the next byte is a matching pair for it and re-combine them, but this seems annoying. With this patch, it would be possible to do this using RawValue much easier:

// NOTE: implementation below likely not optimal.

fn main() {
    let str = r#"["\ud83c","\udf0e"]"#;
    let b1: Vec<Box<serde_json::value::RawValue>> = serde_json::from_str(str).unwrap();
    
    let mut intermediary = String::from('"');
    for buf in b1 {
        let str = buf.get();
        intermediary.push_str(&str[1..str.len() - 1]);
    }
    intermediary.push('"');
    
    let recombined: String = serde_json::from_str(&intermediary).unwrap();
    assert_eq!(recombined, "🌎");
}

lucacasonato · 2021-11-25T21:31:05Z

I just want to reiterate: whatever we settle on for ByteBuf (whatever representation it might or might not use for lone surrogates), this patch is still useful as it allows for partial deserialization of JSON containing strings that contain lone surrogates. I need that so the ByteBuf patch can actually be useful for the use case I had intended it for.

dtolnay · 2021-11-25T22:43:14Z

I don't really see how this is feasible to do with the current implementation.

Is this because we're not in agreement that the choice of representation is lossless, or because the algorithm for performing the stitching would be infeasible to implement in practice?

I could go in and check if the last byte is a UTF-8 surrogate, and if the next byte is a matching pair for it and re-combine them

Yep that's what I had in mind:

fn stitch(mut a: ByteBuf, b: ByteBuf) -> ByteBuf {
    let a_len = a.len();
    if let Some(&[0b1110_1101, a1 @ 0b1010_0000..=0b1010_1111, a2]) = a.get(a_len - 3..) {
        if let Some(&[0b1110_1101, b1 @ 0b1011_0000..=0b1011_1111, b2]) = b.get(..3) {
            let n = 1 + ((a1 & 0b0000_1111) as u32) << 16
                | ((a2 & 0b0011_1111) as u32) << 10
                | ((b1 & 0b0000_1111) as u32) << 6
                | (b2 & 0b0011_1111) as u32;
            if let Some(ch) = char::from_u32(n) {
                a.truncate(a_len - 3);
                a.reserve(ch.len_utf8() + b.len() - 3);
                a.extend_from_slice(ch.encode_utf8(&mut [0; 4]).as_bytes());
                a.extend_from_slice(&b[3..]);
                return a;
            }
        }
    }
    a.extend_from_slice(&b);
    a
}

Forcing this kind of manipulation to be done using RawValue instead is not a compelling solution because enabling RawValue anywhere in a dependency graph adds performance overhead into all JSON deserializations that don't even end up deserializing a RawValue, so it's often not an option for codebases, or mean that there are dependencies they won't be able to use. So whatever RawValue does or doesn't do, I feel it's important for there to be an approach that does not involve RawValue.

I'm also not planning on building lossy strings into serde_json, and not keen about large pieces of our string processing code being duplicated downstream.

For these reasons I'm pushing back because it makes sense to keep RawValue the least useful possible, subject to addressing all the use cases for which it actually is necessary.

lucacasonato · 2021-11-26T01:05:48Z

Forcing this kind of manipulation to be done using RawValue instead is not a compelling solution because enabling RawValue anywhere in a dependency graph adds performance overhead into all JSON deserializations that don't even end up deserializing a RawValue, so it's often not an option for codebases, or mean that there are dependencies they won't be able to use. So whatever RawValue does or doesn't do, I feel it's important for there to be an approach that does not involve RawValue.

I was not aware of the performance impact. Currently there is no way to do a partial deserialization of JSON that contains unpaired surrogates. I don't really see how that could be made to work with serde_json::Value, so I think RawValue is the only way that can be made to work. Do you have other ideas?

Is this because we're not in agreement that the choice of representation is lossless, or because the algorithm for performing the stitching would be infeasible to implement in practice?

I am not generally against a lossless representation, but the current choice of lossless representation is not great. I did figure out that this representation is not completely bespoke - it is called WTF-8. There is a Rust crate for it but it, but since WTF-8 must not be used for interchange there is no way to use that crate to get a WTF-8 view on an existing byte buffer containing WTF-8 bytes (this can be hacked around with some unsafe mem::transmutery).

Let's properly document the use of WTF-8 here though: a ByteBuf deserialized from from JSON strings contain WTF-8 encoded bytes, not UTF-8.

I'm also not planning on building lossy strings into serde_json, and not keen about large pieces of our string processing code being duplicated downstream.

Yeah - I understand, I can use the wtf8 crate + some mem::transmute to avoid this.

If the decision is made that the encoding we are going for is really WTF-8, we should follow the encoding spec properly: https://simonsapin.github.io/wtf-8/#encode-to-wtf-8. The case \ud83c\uaaaa is not handled correctly right now for example.

lucacasonato · 2022-01-26T23:04:47Z

@dtolnay Hey, quick poke - have you had a chance to look into / think about my comment above yet?

dtolnay

This looks good to me. Thanks!

The case \ud83c\uaaaa is not handled correctly right now for example.

I would accept a PR to address this case.

lucacasonato mentioned this pull request Nov 25, 2021

Provide an option to replace lone surrogates in strings with replacement characters #827

Closed

lucacasonato mentioned this pull request Nov 25, 2021

fix(repl): don't panic on lone surrogates denoland/deno#12899

Closed

dtolnay reviewed Nov 25, 2021

View reviewed changes

Allow lone surrogates in raw values

51e9616

lucacasonato force-pushed the support_lone_surrogates_in_raw_value branch from 5ec8141 to 51e9616 Compare November 26, 2021 01:08

dtolnay approved these changes Feb 12, 2022

View reviewed changes

dtolnay merged commit 7e56a40 into serde-rs:master Feb 12, 2022

lucacasonato deleted the support_lone_surrogates_in_raw_value branch April 11, 2022 19:57

lucacasonato mentioned this pull request Apr 12, 2022

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow lone surrogates in raw values #830

Allow lone surrogates in raw values #830

lucacasonato commented Nov 25, 2021

lucacasonato commented Nov 25, 2021

dtolnay left a comment

lucacasonato commented Nov 25, 2021

dtolnay commented Nov 25, 2021

dtolnay commented Nov 25, 2021

lucacasonato commented Nov 25, 2021 •

edited

lucacasonato commented Nov 25, 2021

lucacasonato commented Nov 25, 2021

dtolnay commented Nov 25, 2021 •

edited

lucacasonato commented Nov 26, 2021 •

edited

lucacasonato commented Jan 26, 2022

dtolnay left a comment

Allow lone surrogates in raw values #830

Allow lone surrogates in raw values #830

Conversation

lucacasonato commented Nov 25, 2021

lucacasonato commented Nov 25, 2021

dtolnay left a comment

Choose a reason for hiding this comment

lucacasonato commented Nov 25, 2021

dtolnay commented Nov 25, 2021

dtolnay commented Nov 25, 2021

lucacasonato commented Nov 25, 2021 • edited

lucacasonato commented Nov 25, 2021

lucacasonato commented Nov 25, 2021

dtolnay commented Nov 25, 2021 • edited

lucacasonato commented Nov 26, 2021 • edited

lucacasonato commented Jan 26, 2022

dtolnay left a comment

Choose a reason for hiding this comment

lucacasonato commented Nov 25, 2021 •

edited

dtolnay commented Nov 25, 2021 •

edited

lucacasonato commented Nov 26, 2021 •

edited