Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow lone surrogates in raw values #830

Merged

Conversation

lucacasonato
Copy link
Contributor

@lucacasonato
Copy link
Contributor Author

This allows for a proper LossyString type for serde_json: https://gist.github.com/lucacasonato/772e3bbed29db7835d45ea1aa1a6f1c7.

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow why #828 is not suitable for your use case. How come all the string processing logic has to be duplicated instead of deserializing to bytebuf and patching out all the leftover surrogates with fffd?

@lucacasonato
Copy link
Contributor Author

fn main() {
    println!("{}", String::from_utf8_lossy(&[b'a', 237, 160, 188, b'b']));
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=9e9722459d4d8eaa4989ca88e85deef3

Because UTF-8 decoders will treat this sequence that is meant to encode the lone surrogate as a sequence of three invalid codepoints, thus emitting 3 U+FFFD chars instead of just one. This is the case for both the UTF8 parser in Rust's stdlib, and the one in encoding_rs, and the one specified by https://encoding.spec.whatwg.org (browser ones). As such, I don't think this is a valid way of representing lone surrogates.

The PR is useful even if no lone surrogates are used, because right now RawValue can not represent strings with lone surrogates, even though ByteBuf can decode them.

Ideally LossyString would be a serde_json native magic type, like RawValue, that will handle the replacement chars internally. I was not sure how in favor you'd be of those, which is why I prototyped it out of tree.

@dtolnay
Copy link
Member

dtolnay commented Nov 25, 2021

You're fully in control of the interpretation of bytes to utf-8 though. It doesn't really seem relevant how other logic that interprets bytes to utf-8 does it if those aren't the logic you need.

My expectation would be something like:

let mut bytes: ByteBuf = /* ... */;
for i in 2..bytes.len() {
    // surrogates are codepoints 1101_1xxxxx_xxxxxx
    if bytes[i - 2] == 0b1110_1101 && bytes[i - 1] >= 0b1010_0000 {
        bytes[i - 2..=i].copy_from_slice("\u{fffd}".as_bytes());
    }
}

@dtolnay
Copy link
Member

dtolnay commented Nov 25, 2021

To elaborate a bit in response to:

I don't think this is a valid way of representing lone surrogates.

If "\ud83c" gets assigned some meaning by serde_json (as required by #827, since JSON.stringify can apparently emit such a string in JS), then there needs to be some way we pick to represent it. Whatever we do is a relatively arbitrary choice but the reason that the representation I suggested in #827 is a good one is it's lossless — so for example one could take split pieces of a string from JS, like JSON.stringify(s.substr(0, i)) + JSON.stringify(s.substr(i)), deserialize them in Rust and reconstruct the original unsplit string even when the split turned out to fall in the middle of a utf-16 surrogate pair. Among all the possibilities for a lossless representation I believe this is the most straightforward one, but I would be open to one that you find more straightforward.

@lucacasonato
Copy link
Contributor Author

lucacasonato commented Nov 25, 2021

That would work, but it still seems pretty unfortunate that the ByteBuf contains a byte sequence that looks mostly like UTF-8 encoded text, but isn't really, and a UTF-8 decoder will choke up on it. I'd much rather prefer this:

  • A ByteBuf deserialized from a JSON string is always a byte sequence of a valid UTF-8 encoded string (error on lone surrogate / invalid surrogate pairs).
    • Most people likely expect that they can put the result into either String::from_utf8 or String::from_utf8_lossy and get a reasonable result.
    • This means reverting my previous patch.
  • If you want the behaviour where of lone surrogates and invalid surrogate pairs are replaced with U+FFFD, use a LossyString (replace on lone surrogate / invalid surrogate pairs).
    • This could be built in to serde_json, or implemented elsewhere with copied escape parsing code
    • This is likely what people want for many JSON messages coming from JavaScript, as that is what the built in UTF-8 encoder does - it never emits [237, 160, 188] (emitting U+FFFD instead)
    • The current ByteBuf case only handles lone surrogates, but what about invalid surrogate pairs? I would want those to be replaced too, but serde_json just errors on them right now. The JSON string r#""\ud83c\uaaaa""# should emit a single U+FFFD, not two. Currently it always errors (both when decoding a ByteBuf and regular String). If ByteBuf would emit this invalid surrogate pair, how would one replace the pair with a single U+FFFD (not two!)
  • If you want full control over the string itself, you can parse it yourself using RawValue.
    • RawValue should also be used if you are partially deserializing a message that could contain lone surrogates or invalid surrogate pairs. Value can not encode lone surrogates, or invalid surrogate pairs.
    • RawValue is also useful if you are forwarding some valid JSON (currently not possible if that JSON contains lone surrogates or invalid surrogate pairs in \u escape sequences.

EDIT: looks like we were both typing at once :-) - this comment does not incorporate a response to your previous comment yet.

@lucacasonato
Copy link
Contributor Author

so for example one could take split pieces of a string from JS, like JSON.stringify(s.substr(0, i)) + JSON.stringify(s.substr(i)), deserialize them in Rust and reconstruct the original unsplit string even when the split turned out to fall in the middle of a utf-16 surrogate pair

I don't really see how this is feasible to do with the current implementation. Let's say you have the string 🌎 and you split this at the first codepoint: JSON.stringify(["🌎".substring(0, 1), "🌎".substring(1, 2)]);. You will get two JSON strings: ["\\ud83c","\\udf0e"]. If I were to parse both of these strings using byte buffers, I'd get two byte buffers: 0xED 0xA0 0xBC and 0xED 0xBC 0x8E. Concatenated to a single byte buffer this would be 0xED 0xA0 0xBC 0xED 0xBC 0x8E. I could not send this through a UTF8 decoder directly. It would either error, or give me back a string with 6 replacement chars. I could go in and check if the last byte is a UTF-8 surrogate, and if the next byte is a matching pair for it and re-combine them, but this seems annoying. With this patch, it would be possible to do this using RawValue much easier:

// NOTE: implementation below likely not optimal.

fn main() {
    let str = r#"["\ud83c","\udf0e"]"#;
    let b1: Vec<Box<serde_json::value::RawValue>> = serde_json::from_str(str).unwrap();
    
    let mut intermediary = String::from('"');
    for buf in b1 {
        let str = buf.get();
        intermediary.push_str(&str[1..str.len() - 1]);
    }
    intermediary.push('"');
    
    let recombined: String = serde_json::from_str(&intermediary).unwrap();
    assert_eq!(recombined, "🌎");
}

@lucacasonato
Copy link
Contributor Author

I just want to reiterate: whatever we settle on for ByteBuf (whatever representation it might or might not use for lone surrogates), this patch is still useful as it allows for partial deserialization of JSON containing strings that contain lone surrogates. I need that so the ByteBuf patch can actually be useful for the use case I had intended it for.

@dtolnay
Copy link
Member

dtolnay commented Nov 25, 2021

I don't really see how this is feasible to do with the current implementation.

Is this because we're not in agreement that the choice of representation is lossless, or because the algorithm for performing the stitching would be infeasible to implement in practice?

I could go in and check if the last byte is a UTF-8 surrogate, and if the next byte is a matching pair for it and re-combine them

Yep that's what I had in mind:

fn stitch(mut a: ByteBuf, b: ByteBuf) -> ByteBuf {
    let a_len = a.len();
    if let Some(&[0b1110_1101, a1 @ 0b1010_0000..=0b1010_1111, a2]) = a.get(a_len - 3..) {
        if let Some(&[0b1110_1101, b1 @ 0b1011_0000..=0b1011_1111, b2]) = b.get(..3) {
            let n = 1 + ((a1 & 0b0000_1111) as u32) << 16
                | ((a2 & 0b0011_1111) as u32) << 10
                | ((b1 & 0b0000_1111) as u32) << 6
                | (b2 & 0b0011_1111) as u32;
            if let Some(ch) = char::from_u32(n) {
                a.truncate(a_len - 3);
                a.reserve(ch.len_utf8() + b.len() - 3);
                a.extend_from_slice(ch.encode_utf8(&mut [0; 4]).as_bytes());
                a.extend_from_slice(&b[3..]);
                return a;
            }
        }
    }
    a.extend_from_slice(&b);
    a
}

Forcing this kind of manipulation to be done using RawValue instead is not a compelling solution because enabling RawValue anywhere in a dependency graph adds performance overhead into all JSON deserializations that don't even end up deserializing a RawValue, so it's often not an option for codebases, or mean that there are dependencies they won't be able to use. So whatever RawValue does or doesn't do, I feel it's important for there to be an approach that does not involve RawValue.

I'm also not planning on building lossy strings into serde_json, and not keen about large pieces of our string processing code being duplicated downstream.

For these reasons I'm pushing back because it makes sense to keep RawValue the least useful possible, subject to addressing all the use cases for which it actually is necessary.

@lucacasonato
Copy link
Contributor Author

lucacasonato commented Nov 26, 2021

Forcing this kind of manipulation to be done using RawValue instead is not a compelling solution because enabling RawValue anywhere in a dependency graph adds performance overhead into all JSON deserializations that don't even end up deserializing a RawValue, so it's often not an option for codebases, or mean that there are dependencies they won't be able to use. So whatever RawValue does or doesn't do, I feel it's important for there to be an approach that does not involve RawValue.

I was not aware of the performance impact. Currently there is no way to do a partial deserialization of JSON that contains unpaired surrogates. I don't really see how that could be made to work with serde_json::Value, so I think RawValue is the only way that can be made to work. Do you have other ideas?

Is this because we're not in agreement that the choice of representation is lossless, or because the algorithm for performing the stitching would be infeasible to implement in practice?

I am not generally against a lossless representation, but the current choice of lossless representation is not great. I did figure out that this representation is not completely bespoke - it is called WTF-8. There is a Rust crate for it but it, but since WTF-8 must not be used for interchange there is no way to use that crate to get a WTF-8 view on an existing byte buffer containing WTF-8 bytes (this can be hacked around with some unsafe mem::transmutery).

Let's properly document the use of WTF-8 here though: a ByteBuf deserialized from from JSON strings contain WTF-8 encoded bytes, not UTF-8.

I'm also not planning on building lossy strings into serde_json, and not keen about large pieces of our string processing code being duplicated downstream.

Yeah - I understand, I can use the wtf8 crate + some mem::transmute to avoid this.


If the decision is made that the encoding we are going for is really WTF-8, we should follow the encoding spec properly: https://simonsapin.github.io/wtf-8/#encode-to-wtf-8. The case \ud83c\uaaaa is not handled correctly right now for example.

@lucacasonato lucacasonato force-pushed the support_lone_surrogates_in_raw_value branch from 5ec8141 to 51e9616 Compare November 26, 2021 01:08
@lucacasonato
Copy link
Contributor Author

@dtolnay Hey, quick poke - have you had a chance to look into / think about my comment above yet?

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks!

The case \ud83c\uaaaa is not handled correctly right now for example.

I would accept a PR to address this case.

@dtolnay dtolnay merged commit 7e56a40 into serde-rs:master Feb 12, 2022
@lucacasonato lucacasonato deleted the support_lone_surrogates_in_raw_value branch April 11, 2022 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants